Page 4 of 5

In this section we will compare two of the latest algorithms namely the works of Tagliasacchi et al [2015] and Taylor et al. [2016] along with quantitative analysis.

Tagliasacchi et al [2015] presented an algorithm for tracking the hand in real-time using a single depth camera. It is based on using a 3D model of the hand and fit it to a series of depth images. They used an optimization problem involving six different energy functions, which can be seen later in detail.

An overall view of the algorithm is shown in Figure 5. It first begins with retrieving the data for the tracking from the user by means of depth images from RGBD cameras such as the Intel Realsense or from Kinect. Then the 2D and 3D correspondences if the images are sent to the solver which will then fit a 3D articulated model of the hand to the data. This process is fast and hence the system can run in real-time to track the hands as they move.

The input is a video from a single 60 fps (frames per second) RGBD depth camera. Tagliasacchi et al. [2015] used two different sensors for experimenting the different type of outputs that can arise from sensors. In their paper, they have used two types of sensors, namely the Creative Gesture Camera and the PrimeSense Carmine. The outputs of them are shown in Figure 6. The top row is from the former sensor and the bottom row is the latter. The output from the Creative Interactive Gesture camera shows a clear 2D projection, however, the point cloud is very noisy. On the other hand, the output from the PrimeSense camera has a smooth point cloud but the 2D projection has some gaps.

The model used in this article is a 26 DOF model as seen in Figure 7. As seen in order, the first image is the cylindrical model used for tracking. The second image is the skeletal structure to show the kinematic parameters. The next image is the BVH skeleton exported to Maya to drive the rendering. In this model, the separate DOF of each joint is seen. The total number is 26 which is lower than the actual human hand with the thumb at 2 DOF instead of the conventional 3 DOF. Finally, the last image is the rendered hand model from Maya.

In this stage, the 2D projection from the RGBD sensor is calculated. It is done by means of first extracting the region of interest (ROI) of the hand by using PCA. A wristband is worn on the hand for this purpose. The colour of the wristband is segmented from the frame with the help of colour segmentation and then a principal axis is then made with the help of PCA. This axis is then used to get an ROI of the hand. The pixels in this section of the point cloud is then used to create a silhouette image S_{s}

The optimization problem at hand is now to minimize six different energy functions which are of two major categories: Fitting terms and the Prior terms.

Let*F* be the input data coming from the sensor which consists of a 3D point cloud χ_{s} and 2D silhouette S_{s}. Given a 3D hand model *M* with joint parameters θ = {θ_{1}, θ_{2}, ... , θ_{26}} , we aim at recovering the pose θ of the user's hand, matching the sensor input data *F*.

Let

Where E

The 3D energy term E

where x represents a 3D point of χ

The 2D silhouette energy term E

Here p is a 2D point of the rendered silhouette S

Minimizing this function ensures that the hand does not twist and turn around the joint when tracking. The function is as shown below

Here k_{0}(θ) is the 3D position of the wrist joint, and *l* is the line extracted from computing the PCA of the wristband (See Figure 8)

Using only the data provided by the sensors will be inaccurate, dues to noises created by the sensor itself or other external factors. Hence the output sometimes even creates hand poses that are not actually plausible by a real human, such as a finger bent in the wrong direction. These kind of problems are tackled by the use of datasets. An example of such a dataset is the one made by Schroder et al. [2014] which is made as a public dataset and is used by Tagliasacchi et al [2015] for their algorithm. The energy functions are

where μ is the PCA mean. The matrix π_{P}, i.e. the PCA basis, reconstructs the hand posture from the low-dimensional space. To avoid unlikely hand poses in the subspace, the PCA weights (θ^{~}) are regularized by using another function.

Σ is a diagonal matrix containing the inverse of the standard deviation of the PCA basis.

The PCA model is not enough for finding proper poses of the human hand since the output of PCA is symmetric around the mean, resulting in fingers bending the opposite direction. Along with this, the model cannot describe about self occluded fingers. Hence the kinematic prior is used.

where the function d(c_{i},c_{j}) is the euclidean distance between the two cylinders c_{i} and c_{j} and *r* is the sum of the radii of the cylinders. Χ(i,j) is an indicator function which evaluates to one if the the cylinders *i* and *j* are colliding, else will be zero.

To prevent the hand from reaching an impossible posture by overbending the joints, we limit the joint angles of the hand model:

where each hand joint is associated with conservative bounds (the theta values) and the indicator functions (chi)

This energy function is used to address the jitter problem by enforcing a smoothness on the transition of hand poses.

Where k^{.} k^{..} represent consecutive joints of the individual fingers and thumb (see k_{1}, k_{2}, k_{3} in Figure 7)

The evaluation was done on the Dexter-1 Dataset which shows that the algorithm performs relatively better than the other specified algorithms, namely Tang et al. [2015] and Sridhar et al. [2013]. The results are seen in Figure 10.

Although the algorithm runs well with positive results, it has a flaw in which it heavily relies on the features on the hand for tracking the 3D model. That means for poses that have fewer features like a clenched fist does not track that well in the system. Another flaw is that the algorithm is also dependent on the sensor. When a lower end sensor is used, then the output of the 3D tracking is not efficient and has errors.

Taylor et al. [2016] worked on a tracking model with similar flavours as Tagliasacchi et al [2015] but made several changes to the model-fitting, namely an objective function with more discriminative properties, a model with a smooth surface that provides gradients required for non-linear optimization, and a correspondence between the observed data and the model surface, hence achieving joint optimization.

**Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!** Follow @completegate

Given a sensor which provides a stream of depth images, the frames are preprocessed using hand segmentation and fingertip detection to create:

In overview, the frames from the input sensor are first preprocessed by means of colour segmentation and fingertip detection. Then the starting points are generated for the reinitialization and temporal prediction. The energies are then minimized to report the best pose with the least energy.

Taylor et al. [2016] defined a smooth energy function that can be optimized using the Gauss-Newton method as a weighted sum of different terms that will ensure an efficient and optimal model. These terms are:

data: The data from the point cloud x_{n} must match the rendered surface points S_{θ}.

bg: Points from the model must not blend with the background.

pose: The retrieved pose θ must be a plausible pose.

limit: The retrieved pose θ must follow the bounds set by an actual human finger joint.

temp: The movement of the hand must not be jittery and temporal data must be smooth.

int: The hand model must not collide with itself.

tips: The fingertip of the model hand must match the fingertips derived from the input data.

Hence the total function is

where θ is the pose of the hand and*Terms* corresponds to each of the term discussed above (data, bg, pose, limit, temp, int and tips).

data: The data from the point cloud x

bg: Points from the model must not blend with the background.

pose: The retrieved pose θ must be a plausible pose.

limit: The retrieved pose θ must follow the bounds set by an actual human finger joint.

temp: The movement of the hand must not be jittery and temporal data must be smooth.

int: The hand model must not collide with itself.

tips: The fingertip of the model hand must match the fingertips derived from the input data.

Hence the total function is

where θ is the pose of the hand and

This compares the data points from the input cloud x_{n} to the surface points S_{θ}

where σ_{x}^{2},σ_{n}_{}^{2} are estimates of the noise variance on points and normals. The normal term allows the energy to use surface orientation to select better locations on the model even when they are far from the data. See Figure 11.

The data term does not prevent the model to enter into the background. Hence the bg term is used.

where π is the projection to the image plane and *D* is the image-space distance transform.

The poses term is used so that the joints take plausible values when occlusion occurs in the data. It uses a multivariate Gaussian distribution with mean pose μ and covariance matrix Σ.

In order to prevent the fingers from bending beyond the capabilities of an actual human hand, this term is used. In this equation, a vector of joint angle minima is used.

where *ε(a, x, b) = max(0, a-x) + max(x-b, 0)*

Temporal consistency is ensured by this term, ensuring the smooth movement of the hand. Pose from one frame should be near the pose from the next frame.

In order to discourage finger self-intersection, this term is used.

where h_{st}(θ) measures the amount of penetration between spheres s and t. *P* contains a set of spheres {S_{1},S_{2},..} where is pair is a set of sphere which are not directly adjacent to each other i.e. not in the neighbourhood of each other. c_{i} is the centre of sphere S_{i}.

The fingertips are matched with the following function

where s_{f} is the vector filled with the values, and softmin is the differentiable operator

The algorithm was compared directly with Tagliasacchi et al. [2015], Tang et al. [2015] and Sridhar et al. [2013]. It was also compared with other state-of-the-art algorithms. The accuracy is calculated by reporting the number of frames for which either the average or the maximum marker position error was less than a certain threshold.