Recent Methods
In this section we will compare two of the latest algorithms namely the works of Tagliasacchi et al [2015] and Taylor et al. [2016] along with quantitative analysis.
Robust Articulated ICP
Tagliasacchi et al [2015] presented an algorithm for tracking the hand in real-time using a single depth camera. It is based on using a 3D model of the hand and fit it to a series of depth images. They used an optimization problem involving six different energy functions, which can be seen later in detail.
Overview
 
Overview of the algorithm used by Tagliasacchi et al. [2015]. For each frame, a 3D point cloud is extracted and a 2D silhouette is then made from it. From this data, a cylindrical model of the hand is then aligned to fit it.
Figure 5: Overview of the algorithm used by Tagliasacchi et al. [2015]. For each frame, a 3D point cloud is extracted and a 2D silhouette is then made from it. From this data, a cylindrical model of the hand is then aligned to fit it.
 

An overall view of the algorithm is shown in Figure 5. It first begins with retrieving the data for the tracking from the user by means of depth images from RGBD cameras such as the Intel Realsense or from Kinect. Then the 2D and 3D correspondences if the images are sent to the solver which will then fit a 3D articulated model of the hand to the data. This process is fast and hence the system can run in real-time to track the hands as they move.
Data acquisition
The input is a video from a single 60 fps (frames per second) RGBD depth camera. Tagliasacchi et al. [2015] used two different sensors for experimenting the different type of outputs that can arise from sensors. In their paper, they have used two types of sensors, namely the Creative Gesture Camera and the PrimeSense Carmine. The outputs of them are shown in Figure 6. The top row is from the former sensor and the bottom row is the latter. The output from the Creative Interactive Gesture camera shows a clear 2D projection, however, the point cloud is very noisy. On the other hand, the output from the PrimeSense camera has a smooth point cloud but the 2D projection has some gaps.
 
Outputs from the Creative Gesture Camera and the PrimeSense Carmine. The point cloud and the silhouette are denoted as χs and Ss respectively
Figure 6: Outputs from the Creative Gesture Camera and the PrimeSense Carmine. The point cloud and the silhouette are denoted as χs and Ss respectively
Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!
Tracking model
The model used in this article is a 26 DOF model as seen in Figure 7. As seen in order, the first image is the cylindrical model used for tracking. The second image is the skeletal structure to show the kinematic parameters. The next image is the BVH skeleton exported to Maya to drive the rendering. In this model, the separate DOF of each joint is seen. The total number is 26 which is lower than the actual human hand with the thumb at 2 DOF instead of the conventional 3 DOF. Finally, the last image is the rendered hand model from Maya.
The template hand model used by Tagliasacchi et al. [2015].
Figure 7: The template hand model used by Tagliasacchi et al. [2015].
 
Preprocessing
In this stage, the 2D projection from the RGBD sensor is calculated. It is done by means of first extracting the region of interest (ROI) of the hand by using PCA. A wristband is worn on the hand for this purpose. The colour of the wristband is segmented from the frame with the help of colour segmentation and then a principal axis is then made with the help of PCA. This axis is then used to get an ROI of the hand. The pixels in this section of the point cloud is then used to create a silhouette image Ss
 
Identification of ROI of hand by Tagliasacchi et al. [2015]. First the position of the wristband is found by colour segmentation, then the orientation of the forearm in 3D space is found by using PCA. From this orientation, the position of the hand can be found by using an offset. From this ROI, the 2D silhouette and the point cloud data can be computed and is shown as well
Figure 8: Identification of ROI of hand by Tagliasacchi et al. [2015]. First the position of the wristband is found by colour segmentation, then the orientation of the forearm in 3D space is found by using PCA. From this orientation, the position of the hand can be found by using an offset. From this ROI, the 2D silhouette and the point cloud data can be computed and is shown as well
 
Optimization
The optimization problem at hand is now to minimize six different energy functions which are of two major categories: Fitting terms and the Prior terms.

Let F be the input data coming from the sensor which consists of a 3D point cloud χs and 2D silhouette Ss. Given a 3D hand model M with joint parameters θ = {θ1, θ2, ... , θ26} , we aim at recovering the pose θ of the user's hand, matching the sensor input data F.


Where E3D, E2D and Ewrist are the fitting terms and Epose, Ekinematic and  Etemporal are the prior terms.

1. Point cloud Alignment
    
    The 3D energy term E3D is calculated in the spirit of ICP as minimizing
   
    where  x represents a 3D point of χs, |.|2 denotes the  l2 norm, πM (x,θ) is the projection of x onto the hand model M with hand pose θ and ω1 is the weight given to this energy. This definition of ωi applies for the further energy functions as well. It is devised such that they are the closest point on the front facing part of M. This proves to prevent local optima as seen in Figure 9.
    
 
Figure 9: Example to show the correspondences computations. The circles represent the fingers of the human hand and the black dots are the depth map output. Part 9a shows the conventional method and it works in this scenario. However in part 9b the method fails since the closest point converges to a local optima. This problem is resolved by the new method, part 9c, which takes the front facing points into account.
Figure 9: Example to show the correspondences computations. The circles represent the fingers of the human hand and the black dots are the depth map output. Part 9a shows the conventional method and it works in this scenario. However in part 9b the method fails since the closest point converges to a local optima. This problem is resolved by the new method, part 9c, which takes the front facing points into account.
   
2. Silhouette alignment
    
    The 2D silhouette energy term E2D is calculated from the sensor data as
   

    Here p is a 2D point of the rendered silhouette Sr, πSs (p,θ) is the the projection of p onto the sensor silhouette Ss. The use of this energy function is shown in Figure 10. If this function is not used, it will cause errors in the position of occluded fingers.
 
 
 
Figure 10: Illustration to the consequence of 2D energy function.
Figure 10: Illustration to the consequence of 2D energy function.
   
3. Wrist alignment
    
    Minimizing this function ensures that the hand does not twist and turn around the joint when tracking. The function is as shown below
  
    Here k0(θ) is the 3D position of the wrist joint, and l is the line extracted from computing the PCA of the wristband (See Figure 8)
   
Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!
4. Pose Space prior (data-driven)
    
    Using only the data provided by the sensors will be inaccurate, dues to noises created by the sensor itself or other external factors. Hence the output sometimes even creates hand poses that are not actually plausible by a real human, such as a finger bent in the wrong direction. These kind of problems are tackled by the use of datasets. An example of such a dataset is the one made by Schroder et al. [2014] which is made as a public dataset and is used by Tagliasacchi et al [2015] for their algorithm. The energy functions are
    
    where μ is the PCA mean. The matrix πP, i.e. the PCA basis, reconstructs the hand posture from the low-dimensional space. To avoid unlikely hand poses in the subspace, the PCA weights (θ~) are regularized by using another function.
    
    Σ is a diagonal matrix containing the inverse of the standard deviation of the PCA basis.
    
5. Kinematic prior
    
    The PCA model is not enough for finding proper poses of the human hand since the output of PCA is symmetric around the mean, resulting in fingers bending the opposite direction. Along with this, the model cannot describe about self occluded fingers. Hence the kinematic prior is used.
    
    where the function d(ci,cj) is the euclidean distance between the two cylinders ci and cj and r is the sum of the radii of the cylinders. Χ(i,j) is an indicator function which evaluates to one if the the cylinders i and j are colliding, else will be zero.
    
    To prevent the hand from reaching an impossible posture by overbending the joints, we limit the joint angles of the hand model:
  
    where each hand joint is associated with conservative bounds (the theta values) and the indicator functions (chi)
    
6. Temporal prior
    
    This energy function is used to address the jitter problem by enforcing a smoothness on the transition of hand poses.
   
    Where k. k.. represent consecutive joints of the individual fingers and thumb (see k1, k2, k3 in Figure 7)
Evaluation
The evaluation was done on the Dexter-1 Dataset which shows that the algorithm performs relatively better than the other specified algorithms, namely Tang et al. [2015] and Sridhar et al. [2013]. The results are seen in Figure 10.
 
Report of the evaluation of Tagliasacchi et al. [2015] against Tang et al. [2015] and Sridhar et al. [2013]. The measurements report the root mean square errors of fingertip placements. The green, blue and purple bars show the algorithms of the other authors as it is while the red and orange bars show the running of the algorithm of Tagliasacchi et al. [2015] with and without reinitialization respectively.
Figure 10: Report of the evaluation of Tagliasacchi et al. [2015] against Tang et al. [2015] and Sridhar et al. [2013]. The measurements report the root mean square errors of fingertip placements. The green, blue and purple bars show the algorithms of the other authors as it is while the red and orange bars show the running of the algorithm of Tagliasacchi et al. [2015] with and without reinitialization respectively.
 
Limitations
Although the algorithm runs well with positive results, it has a flaw in which it heavily relies on the features on the hand for tracking the 3D model. That means for poses that have fewer features like a clenched fist does not track that well in the system. Another flaw is that the algorithm is also dependent on the sensor. When a lower end sensor is used, then the output of the 3D tracking is not efficient and has errors.
Joint, Continuous Optimization of Pose and Correspondences
Taylor et al. [2016] worked on a tracking model with similar flavours as Tagliasacchi et al [2015] but made several changes to the model-fitting, namely an objective function with more discriminative properties, a model with a smooth surface that provides gradients required for non-linear optimization, and a correspondence between the observed data and the model surface, hence achieving joint optimization.
 
Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!
Overall
Given a sensor which provides a stream of depth images, the frames are preprocessed using hand segmentation and fingertip detection to create:

In overview, the frames from the input sensor are first preprocessed by means of colour segmentation and fingertip detection. Then the starting points are generated for the reinitialization and temporal prediction. The energies are then minimized to report the best pose with the least energy.
Energy Function
Taylor et al. [2016] defined a smooth energy function that can be optimized using the Gauss-Newton method as a weighted sum of different terms that will ensure an efficient and optimal model. These terms are:

    data: The data from the point cloud xn must match the rendered surface points Sθ.
    bg: Points from the model must not blend with the background.
    pose: The retrieved pose θ must be a plausible pose.
    limit: The retrieved pose θ must follow the bounds set by an actual human finger joint.
    temp: The movement of the hand must not be jittery and temporal data must be smooth.
    int: The hand model must not collide with itself.
    tips: The fingertip of the model hand must match the fingertips derived from the input data.

Hence the total function is

where θ is the pose of the hand and Terms corresponds to each of the term discussed above (data, bg, pose, limit, temp, int and tips).
 
This illustration shows the cross section of the fingers (which are in green) and the data points (in orange) that is to be matched. Without using the normals, the updates causes the wrong finger to be selected (left most image). This problem is solved by using the normal term (centre image). The rule also applies for a planar model as well (right most image).
Figure 11: This illustration shows the cross section of the fingers (which are in green) and the data points (in orange) that is to be matched. Without using the normals, the updates causes the wrong finger to be selected (left most image). This problem is solved by using the normal term (centre image). The rule also applies for a planar model as well (right most image).
data: Data Term
    
    This compares the data points from the input cloud xn to the surface points Sθ
   
    where σx2,σn2 are estimates of the noise variance on points and normals. The normal term allows the energy to use surface orientation to select better locations on the model even when they are far from the data. See Figure 11.
    
bg: Background Penetration Penalty
    
    The data term does not prevent the model to enter into the background. Hence the bg term is used.
   
    where π is the projection to the image plane and D is the image-space distance transform.
    
pose: Poses prior
    
    The poses term is used so that the joints take plausible values when occlusion occurs in the data. It uses a multivariate Gaussian distribution with mean pose μ and covariance matrix Σ.
   
    
limit: Joint Limit Constraints
    
    In order to prevent the fingers from bending beyond the capabilities of an actual human hand, this term is used. In this equation, a vector of joint angle minima is used.
   
    where ε(a, x, b) = max(0, a-x) + max(x-b, 0)
    
temp: Temporal Prior
    
    Temporal consistency is ensured by this term, ensuring the smooth movement of the hand. Pose from one frame should be near the pose from the next frame.
   
    
int: Self-Intersection Penalty
    
    In order to discourage finger self-intersection, this term is used.
   
    where hst(θ) measures the amount of penetration between spheres s and t. P contains a set of spheres {S1,S2,..} where is pair is a set of sphere which are not directly adjacent to each other i.e. not in the neighbourhood of each other. ci is the centre of sphere Si.
    
tips: Fingertip Term
    
    The fingertips are matched with the following function
   
    where sf is the vector filled with the values, and softmin is the differentiable operator
 
Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!
Evaluation
The algorithm was compared directly with Tagliasacchi et al. [2015]Tang et al. [2015] and Sridhar et al. [2013]. It was also compared with other state-of-the-art algorithms. The accuracy is calculated by reporting the number of frames for which either the average or the maximum marker position error was less than a certain threshold.
Report of the evaluation of Taylor et al. [2016] against Tagliasacchi et al. [2015], Tang et al. [2015], Oberweger et al. [2015], Tompson et al. [2014] and Sridhar et al. [2013]. The dataset used is the DEXTER dataset
Figure 12: Report of the evaluation of Taylor et al. [2016] against Tagliasacchi et al. [2015], Tang et al. [2015], Oberweger et al. [2015], Tompson et al. [2014] and Sridhar et al. [2013]. The dataset used is the DEXTER dataset
 
You Might Like
You Might Like