Article Index

Three-dimensional hand tracking has attracted increasing attention and research interests in fields such as virtual reality, HCI and computer vision. Due to depth sensors, there is a plethora of new applications and methods for tracking the human hand, which were once very difficult with only 2D cameras. Another milestone is when predictive algorithms came to picture rather than tracking based on images alone. This survey presents some of the recent types and methods used for 3D hand tracking with the focus on vision-based tracking algorithms. We will first review the conventional methods that were used and then the newer methods with smart predictive algorithms.

The human hand is delicate and has a very intricate structure. The various muscles and joints in the hand provide a great range of movement and precision, which allows many different poses and gestures. These gestures are important in communicating the thoughts of an individual and can be useful to interact with devices. Hence, 3D hand tracking is of great importance to HCI, and can also be used for artistic, medical, or scientific purposes, human sign language interpretation, and for many other tasks.

Hand gesture recognition is a subset of hand tracking in which users perform gestures that the system identifies as unique gestures and corresponds accordingly [Song et al. 2015]. This kind of system will only see the action performed by the user's hand and may also find the position of the hand as well. This type of work can be found in hand gesture recognition literature such as Cheng et al. [2016]

Parts of hand tracking

Hand tracking has three main parts: Data acquisition, recognition and application. In the first part, data will be collected from the user or environment for processing. It is in this part where the sensors and prior data such as public datasets are used. The sensors are used to get the video feed (or even the depth video) of the human hands. Datasets are used to provide prior data for more accurate recognition and results during processing.

The second part is recognition and it involves analysing the data to arrive at the results. In this part, the data received from the first part is taken and then processed for finding the orientation and position of the hand (or hands). The algorithm used will depend on the sensor used to retrieve the data and it must be able to give the results in real time.

The third part is application and it is the end goal of the whole process used from the beginning. One of the main goals for 3D tracking is its use in HCI, such as interacting with objects in virtual or augmented reality [Benko et al. 2012]. Another popular use is in gaming platforms where the user will use their hands and perform actions and gestures to move through the different levels of the game.

Motivation

There are three main motivations for implementing 3D tracking of a complex system like the human hand. They are:
  • Tracking a deformable object with many degrees of freedom such as the human hand is still a very challenging and difficult problem to resolve.
  • The current improvements in depth sensors increase the number of ways to perform efficient tracking methods and algorithms.
  • 3D hand tracking has a large influence on HCI. From a medical perspective, there are many applications where it can prove useful.
Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!

Problem Statement

The overall aim of 3D tracking is simple: To track the user's hand in 3D space precisely and accurately. But there are several side goals that are required to be achieved in order to get the optimum effect. The hand must be tracked such that the rendered output is perceived as a true replication of the human hand. The requirements of hand tracking are the following: 
  • According to the human sense of perception, the minimum rate of which the human can perceive one image from another image is on average 67 milliseconds (15 frames per second). Beyond which is perceived as a continuous motion and below which will be considered as separate images. The system must provide an accurate position and orientation of the hand within this time for each frame so the human eyes can perceive it as a constant motion.
  • The rendered hand must not jitter in 3D space as this does not conform to the true movement of the hand. The movement must be smooth and continuous.
  • The rendered hand poses must conform to the true poses of the human hand, such as the realistic movement of the fingers (one single digit of a finger will not move 90 degrees forward).

Brief overlook

The remainder of this article is organized in the following manner:

Data Acquisition : explain about the various methods to get data from the user for processing along with the advantages and disadvantages for the same.

Recognition: Elucidates the different ways to get the position and orientation of the human hand in 3D space.

Recent Methods: Shows two of the recent algorithms along with evaluations against the other state-of-the-art implementations.

Conclusion and Future work: Concludes the article in which we will encapsulate the various perspectives and methods derived from the study and suggest an optimum way to track the human hand in 3D space and provide accurate results in real time.


Data Acquisition

In the context of hand tracking, data is provided to the system in two ways:

  • Sensors
  • Datasets

Sensors

A sensor takes data using physical properties of the outside world and converts it to digital information that can be processed by electronic systems. For 3D hand tracking, sensors are of mainly three types.

  1. Mount based sensors
  2. Multi touch sensors
  3. Vision based sensors

Mount based sensors
Mount based sensors are ones which are worn (or mounted) on the hand and they provide data to the system. Examples are accelerometers and gyroscopes which can track the relative orientation and position from a reference point for each finger and hand [Prisacariu et al. 2012]. There are different methods for placing the sensors on the hand, such as placing them on each digit of the finger or on strategic locations to calculate the remaining positions. It is highly accurate and capable of tracking the hands to sub-millimetre levels, but it is uncomfortable to wear at times and prevent users from feeling the actual environment. It is also very expensive and cannot be afforded at a larger scale for many people.

Example of a mounted sensor
Figure 1: An example of a mounted sensor. Image courtesy [Guracak, 2016]

Multi-touch sensors

An example of a multi touch sensor
Figure 2: An example of a multi touch sensor. Image courtesy [Rosenberg, 2016]

Multi-touch screen sensors are commonly used in smartphones. They record the point of contact of the human hand and the device. Common examples include the action of pinching which is usually to zoom a selection on screen or drag two fingers in a parallel direction to scroll through a document. Although accurate at certain tasks, a disadvantage with this type is that it only tracks the position of the tip of the fingers or any part of them in contact with the device and doesn't track the position of the hand itself or its orientation in space. Hence this kind of sensor is useful only for gesture-based interactions and not for tracking the whole hand.

Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!

Vision based sensors

Vision-based sensors capture images in the form of frames and send it to the system. They are useful for tracking as they do not require (for some types of sensors) electronic devices to be worn on the hand. They can also provide a larger distance from the user to the screen, unlike the multi-touch sensors. They can be broadly classified into 2D sensors such as the common webcams, and 3D depth sensors like Kinect and Leapmotion. The former only records the image while the latter also records the depth spectrum of the image. This enables a 3D perspective of the image that can be tracked more efficiently. However, the algorithms for these type of sensors is, in general, computationally expensive. In order to reduce the complexity, colour gloves are used that can be easily recognised by the system for tracking. However, this method comes with the same disadvantage as the mounted sensors as they interfere with the comfort of the user and their sense of touch.

 

Examples of vision based sensors
Figure 3: Examples of vision based sensors from Sharp et al. [2015] and Sun et al. [2014]


Vision-based sensors also have variants which are mounted on the human hand or on other parts such as the head as well. [Sun et al. 2014] made a variant which uses a gaze-directed camera. It is a device which is worn on the head and manipulates the focus of the person wearing it. However, they suffer from the same disadvantage of mounted sensors regarding comfort and portability.

 

Datasets

As shown by the works of Tagliasacchi et al [2015], using only the data provided by the sensors will be inaccurate, due to the noise created by the sensor itself or other external factors. Hence the output sometimes even creates hand poses that are not actually plausible by a real human, such as a finger bent in the wrong direction. This kind of problem is tackled by the use of datasets. These datasets have a set of hand positions that are plausible by the human hand and a mapping that will tell which is the next likely hand position given one. An example of such a dataset is the one made by Schroder et al. [2014] which is made as a public dataset and is used by Tagliasacchi et al [2015] for their algorithm. This will be discussed in detail in a later section.


Recognition

Recognition is the process in which the data from the sensors are used to find the position and orientation of the hand in 3D space. The kind of algorithm is heavily dependent on the type of sensor used for the tracking. In this article, we will focus on the vision-based tracking and its types as they are robust and has advantages over the other sensors.

Vision-Based Tracking Techniques

Vision based tracking finds an object in a frame (each successive image in a video) and notes the course of that object in the video. This kind of algorithm is different from a vanilla object detection (or localization) since the information from the previous frame is used to find the object in the next frame. Therefore it is more efficient and stable than finding the location of the object by scanning the whole frame repeatedly. This used to be a difficult process as there are many computations that should be performed in just a few milliseconds for each frame.

The human eye can differentiate one image from another image in a video if the time between frames is 67 ms. If the next frame arrives below this time then the human eye will treat that image as a continuation of the previous image i.e. a motion sequence. Hence for an effective tracking algorithm must find and track an object in each frame at this time. However, with the advent of fast and capable graphics processing units (GPUs), this speed is now possible.

Basic process of vision-based tracking

The overall process can be summarized as the following:

Step 1: The first step for tracking an object is initialization. In this step, the position of the object and the parameters for the algorithm are initialized. The method may be manual or automatic.
    
Step 2: On arrival of the next frame at time t, search for the object locally in a region R provided the position P of the object at time t-1.
    
Step 3: When found, the new position of the object becomes P.
    Step 2 and 3 will repeat until the end of the input video.

Sometimes during tracking or at initialization itself, when there is a small error in the output, that error will tend to propagate through each frame, accumulating and hence affecting the performance of the output. This phenomenon is called drifting. Resolving this issue of drifting is a challenge since doing it automatically usually requires auto reinitialization of the parameters for the algorithm.

Classifications of vision-based tracking
According to Erol et al. [2007], hand tracking techniques
can be divided into two broad techniques:
  • Appearance based techniques
  • Model based techniques
1. Appearance based techniques
Also known as view-based techniques, this technique processes the video of the hand as a sequence of poses and classifies them based on the features derived from it. Hence the quality and the type of features which are taken from the image greatly affects the accuracy of the output. A disadvantage of this technique is that it processes only the 2D models is can not track the exact location and the position of the hand. Hence it is only used for gesture-based interactions such as selecting items on a screen menu, sliding an image, playing music with actions of the arm, etc.

Skin segmentation is a well-known example of appearance-based tracking and it is the process in which the colour of the skin is used to perform segmentation of the image and extract the hand for processing. This method is used by Poudel et al. [2013], Tang et al. [2015] and by Sun et al. [2014]. 
2. Model based techniques
In this kind of technique, a 3D model of the human hand is created and will then be used to show the position and orientation of the actual hand in 3D space. The model will follow the kinematics and structural properties of the human hand. Most of the recent publications are made with the 3D model having 27 degrees of freedom (hereafter called DOF). However, some constraints can be enforced on this DOF based on the actual bionics of the hand. The DOF can also be decreased depending on the application domain of the method.

 
A model with an X-ray of the actual human hand. The square box joint in the 3D model represents 6 DOF of the hand position and orientations. Black circles represent 2 DOF movements like abduction or spreading. White circles represent 1 DOF movement which is flexion
Figure 4: A model with an X-ray of the actual human hand. The square box joint in the 3D model represents 6 DOF of the hand position and orientations. Black circles represent 2 DOF movements like abduction or spreading. White circles represent 1 DOF movement which is flexion. Image courtesy Poudel et al. [2013]
 
Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!
The human hand comprises of 27 bones (See Figure 4). The size of each of these bones depend on the user and varies from individual to individual. The names of the joints are based on the connecting bones. The type of joints and their degrees-of-freedom are described below:
 
  • Interphalangeal Joints (IP): This joint connects each finger phalanges. There are three types of phalanges namely the Distal, Middle and Proximal phalanges. It has one DOF for flexion. For an average human hand, each of these joints can only bend to about 90 degrees at maximum and hence noted by some methods when tracking to prevent infeasible hand position.   
  • Metacarpophalangeal Joints(MCP): This joint connects each finger with the palm. It has two DOF, one for abduction or spreading of the fingers and the other one for flexion.
  • Trapeziometacarpal Joints (TM): This joint connects the thumb to the palm. It has a different form of movement from the CMC and hence is difficult to model. However, it is usually modelled as 2 DOF.
  • Carpometacarpal Joints (CMC): This type of joint connects the metacarpals of the fingers with the wrist.

The model-based techniques create a 3D model of the hand and predict an orientation for it. It will then try to match the predicted model with the observed data. This is a computational step and hence a difficult task. Generally, 3D hand parameters are estimated by edges or depth matching on each frame.

A clear example of this technique is the one from Sharp et al. [2015]. They used the output from a Kinect camera to render a 3D model of the hand and fit it to the data.

Some papers have used methods which incorporate the 2D features from the appearance-based tracking and the 3D features from a model based tracking to fuse and for a unified framework that will effectively track the hand with relatively smaller errors. Sridhar et al. [2013], Sun et al. [2014] and El-Sawah et al. [2008] used features which were derived from the 2D model to initialize the position of the 3D model and track the hand.

Recent Methods
In this section we will compare two of the latest algorithms namely the works of Tagliasacchi et al [2015] and Taylor et al. [2016] along with quantitative analysis.
Robust Articulated ICP
Tagliasacchi et al [2015] presented an algorithm for tracking the hand in real-time using a single depth camera. It is based on using a 3D model of the hand and fit it to a series of depth images. They used an optimization problem involving six different energy functions, which can be seen later in detail.
Overview
 
Overview of the algorithm used by Tagliasacchi et al. [2015]. For each frame, a 3D point cloud is extracted and a 2D silhouette is then made from it. From this data, a cylindrical model of the hand is then aligned to fit it.
Figure 5: Overview of the algorithm used by Tagliasacchi et al. [2015]. For each frame, a 3D point cloud is extracted and a 2D silhouette is then made from it. From this data, a cylindrical model of the hand is then aligned to fit it.
 

An overall view of the algorithm is shown in Figure 5. It first begins with retrieving the data for the tracking from the user by means of depth images from RGBD cameras such as the Intel Realsense or from Kinect. Then the 2D and 3D correspondences if the images are sent to the solver which will then fit a 3D articulated model of the hand to the data. This process is fast and hence the system can run in real-time to track the hands as they move.
Data acquisition
The input is a video from a single 60 fps (frames per second) RGBD depth camera. Tagliasacchi et al. [2015] used two different sensors for experimenting the different type of outputs that can arise from sensors. In their paper, they have used two types of sensors, namely the Creative Gesture Camera and the PrimeSense Carmine. The outputs of them are shown in Figure 6. The top row is from the former sensor and the bottom row is the latter. The output from the Creative Interactive Gesture camera shows a clear 2D projection, however, the point cloud is very noisy. On the other hand, the output from the PrimeSense camera has a smooth point cloud but the 2D projection has some gaps.
 
Outputs from the Creative Gesture Camera and the PrimeSense Carmine. The point cloud and the silhouette are denoted as χs and Ss respectively
Figure 6: Outputs from the Creative Gesture Camera and the PrimeSense Carmine. The point cloud and the silhouette are denoted as χs and Ss respectively
Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!
Tracking model
The model used in this article is a 26 DOF model as seen in Figure 7. As seen in order, the first image is the cylindrical model used for tracking. The second image is the skeletal structure to show the kinematic parameters. The next image is the BVH skeleton exported to Maya to drive the rendering. In this model, the separate DOF of each joint is seen. The total number is 26 which is lower than the actual human hand with the thumb at 2 DOF instead of the conventional 3 DOF. Finally, the last image is the rendered hand model from Maya.
The template hand model used by Tagliasacchi et al. [2015].
Figure 7: The template hand model used by Tagliasacchi et al. [2015].
 
Preprocessing
In this stage, the 2D projection from the RGBD sensor is calculated. It is done by means of first extracting the region of interest (ROI) of the hand by using PCA. A wristband is worn on the hand for this purpose. The colour of the wristband is segmented from the frame with the help of colour segmentation and then a principal axis is then made with the help of PCA. This axis is then used to get an ROI of the hand. The pixels in this section of the point cloud is then used to create a silhouette image Ss
 
Identification of ROI of hand by Tagliasacchi et al. [2015]. First the position of the wristband is found by colour segmentation, then the orientation of the forearm in 3D space is found by using PCA. From this orientation, the position of the hand can be found by using an offset. From this ROI, the 2D silhouette and the point cloud data can be computed and is shown as well
Figure 8: Identification of ROI of hand by Tagliasacchi et al. [2015]. First the position of the wristband is found by colour segmentation, then the orientation of the forearm in 3D space is found by using PCA. From this orientation, the position of the hand can be found by using an offset. From this ROI, the 2D silhouette and the point cloud data can be computed and is shown as well
 
Optimization
The optimization problem at hand is now to minimize six different energy functions which are of two major categories: Fitting terms and the Prior terms.

Let F be the input data coming from the sensor which consists of a 3D point cloud χs and 2D silhouette Ss. Given a 3D hand model M with joint parameters θ = {θ1, θ2, ... , θ26} , we aim at recovering the pose θ of the user's hand, matching the sensor input data F.


Where E3D, E2D and Ewrist are the fitting terms and Epose, Ekinematic and  Etemporal are the prior terms.

1. Point cloud Alignment
    
    The 3D energy term E3D is calculated in the spirit of ICP as minimizing
   
    where  x represents a 3D point of χs, |.|2 denotes the  l2 norm, πM (x,θ) is the projection of x onto the hand model M with hand pose θ and ω1 is the weight given to this energy. This definition of ωi applies for the further energy functions as well. It is devised such that they are the closest point on the front facing part of M. This proves to prevent local optima as seen in Figure 9.
    
 
Figure 9: Example to show the correspondences computations. The circles represent the fingers of the human hand and the black dots are the depth map output. Part 9a shows the conventional method and it works in this scenario. However in part 9b the method fails since the closest point converges to a local optima. This problem is resolved by the new method, part 9c, which takes the front facing points into account.
Figure 9: Example to show the correspondences computations. The circles represent the fingers of the human hand and the black dots are the depth map output. Part 9a shows the conventional method and it works in this scenario. However in part 9b the method fails since the closest point converges to a local optima. This problem is resolved by the new method, part 9c, which takes the front facing points into account.
   
2. Silhouette alignment
    
    The 2D silhouette energy term E2D is calculated from the sensor data as
   

    Here p is a 2D point of the rendered silhouette Sr, πSs (p,θ) is the the projection of p onto the sensor silhouette Ss. The use of this energy function is shown in Figure 10. If this function is not used, it will cause errors in the position of occluded fingers.
 
 
 
Figure 10: Illustration to the consequence of 2D energy function.
Figure 10: Illustration to the consequence of 2D energy function.
   
3. Wrist alignment
    
    Minimizing this function ensures that the hand does not twist and turn around the joint when tracking. The function is as shown below
  
    Here k0(θ) is the 3D position of the wrist joint, and l is the line extracted from computing the PCA of the wristband (See Figure 8)
   
Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!
4. Pose Space prior (data-driven)
    
    Using only the data provided by the sensors will be inaccurate, dues to noises created by the sensor itself or other external factors. Hence the output sometimes even creates hand poses that are not actually plausible by a real human, such as a finger bent in the wrong direction. These kind of problems are tackled by the use of datasets. An example of such a dataset is the one made by Schroder et al. [2014] which is made as a public dataset and is used by Tagliasacchi et al [2015] for their algorithm. The energy functions are
    
    where μ is the PCA mean. The matrix πP, i.e. the PCA basis, reconstructs the hand posture from the low-dimensional space. To avoid unlikely hand poses in the subspace, the PCA weights (θ~) are regularized by using another function.
    
    Σ is a diagonal matrix containing the inverse of the standard deviation of the PCA basis.
    
5. Kinematic prior
    
    The PCA model is not enough for finding proper poses of the human hand since the output of PCA is symmetric around the mean, resulting in fingers bending the opposite direction. Along with this, the model cannot describe about self occluded fingers. Hence the kinematic prior is used.
    
    where the function d(ci,cj) is the euclidean distance between the two cylinders ci and cj and r is the sum of the radii of the cylinders. Χ(i,j) is an indicator function which evaluates to one if the the cylinders i and j are colliding, else will be zero.
    
    To prevent the hand from reaching an impossible posture by overbending the joints, we limit the joint angles of the hand model:
  
    where each hand joint is associated with conservative bounds (the theta values) and the indicator functions (chi)
    
6. Temporal prior
    
    This energy function is used to address the jitter problem by enforcing a smoothness on the transition of hand poses.
   
    Where k. k.. represent consecutive joints of the individual fingers and thumb (see k1, k2, k3 in Figure 7)
Evaluation
The evaluation was done on the Dexter-1 Dataset which shows that the algorithm performs relatively better than the other specified algorithms, namely Tang et al. [2015] and Sridhar et al. [2013]. The results are seen in Figure 10.
 
Report of the evaluation of Tagliasacchi et al. [2015] against Tang et al. [2015] and Sridhar et al. [2013]. The measurements report the root mean square errors of fingertip placements. The green, blue and purple bars show the algorithms of the other authors as it is while the red and orange bars show the running of the algorithm of Tagliasacchi et al. [2015] with and without reinitialization respectively.
Figure 10: Report of the evaluation of Tagliasacchi et al. [2015] against Tang et al. [2015] and Sridhar et al. [2013]. The measurements report the root mean square errors of fingertip placements. The green, blue and purple bars show the algorithms of the other authors as it is while the red and orange bars show the running of the algorithm of Tagliasacchi et al. [2015] with and without reinitialization respectively.
 
Limitations
Although the algorithm runs well with positive results, it has a flaw in which it heavily relies on the features on the hand for tracking the 3D model. That means for poses that have fewer features like a clenched fist does not track that well in the system. Another flaw is that the algorithm is also dependent on the sensor. When a lower end sensor is used, then the output of the 3D tracking is not efficient and has errors.
Joint, Continuous Optimization of Pose and Correspondences
Taylor et al. [2016] worked on a tracking model with similar flavours as Tagliasacchi et al [2015] but made several changes to the model-fitting, namely an objective function with more discriminative properties, a model with a smooth surface that provides gradients required for non-linear optimization, and a correspondence between the observed data and the model surface, hence achieving joint optimization.
 
Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!
Overall
Given a sensor which provides a stream of depth images, the frames are preprocessed using hand segmentation and fingertip detection to create:

In overview, the frames from the input sensor are first preprocessed by means of colour segmentation and fingertip detection. Then the starting points are generated for the reinitialization and temporal prediction. The energies are then minimized to report the best pose with the least energy.
Energy Function
Taylor et al. [2016] defined a smooth energy function that can be optimized using the Gauss-Newton method as a weighted sum of different terms that will ensure an efficient and optimal model. These terms are:

    data: The data from the point cloud xn must match the rendered surface points Sθ.
    bg: Points from the model must not blend with the background.
    pose: The retrieved pose θ must be a plausible pose.
    limit: The retrieved pose θ must follow the bounds set by an actual human finger joint.
    temp: The movement of the hand must not be jittery and temporal data must be smooth.
    int: The hand model must not collide with itself.
    tips: The fingertip of the model hand must match the fingertips derived from the input data.

Hence the total function is

where θ is the pose of the hand and Terms corresponds to each of the term discussed above (data, bg, pose, limit, temp, int and tips).
 
This illustration shows the cross section of the fingers (which are in green) and the data points (in orange) that is to be matched. Without using the normals, the updates causes the wrong finger to be selected (left most image). This problem is solved by using the normal term (centre image). The rule also applies for a planar model as well (right most image).
Figure 11: This illustration shows the cross section of the fingers (which are in green) and the data points (in orange) that is to be matched. Without using the normals, the updates causes the wrong finger to be selected (left most image). This problem is solved by using the normal term (centre image). The rule also applies for a planar model as well (right most image).
data: Data Term
    
    This compares the data points from the input cloud xn to the surface points Sθ
   
    where σx2,σn2 are estimates of the noise variance on points and normals. The normal term allows the energy to use surface orientation to select better locations on the model even when they are far from the data. See Figure 11.
    
bg: Background Penetration Penalty
    
    The data term does not prevent the model to enter into the background. Hence the bg term is used.
   
    where π is the projection to the image plane and D is the image-space distance transform.
    
pose: Poses prior
    
    The poses term is used so that the joints take plausible values when occlusion occurs in the data. It uses a multivariate Gaussian distribution with mean pose μ and covariance matrix Σ.
   
    
limit: Joint Limit Constraints
    
    In order to prevent the fingers from bending beyond the capabilities of an actual human hand, this term is used. In this equation, a vector of joint angle minima is used.
   
    where ε(a, x, b) = max(0, a-x) + max(x-b, 0)
    
temp: Temporal Prior
    
    Temporal consistency is ensured by this term, ensuring the smooth movement of the hand. Pose from one frame should be near the pose from the next frame.
   
    
int: Self-Intersection Penalty
    
    In order to discourage finger self-intersection, this term is used.
   
    where hst(θ) measures the amount of penetration between spheres s and t. P contains a set of spheres {S1,S2,..} where is pair is a set of sphere which are not directly adjacent to each other i.e. not in the neighbourhood of each other. ci is the centre of sphere Si.
    
tips: Fingertip Term
    
    The fingertips are matched with the following function
   
    where sf is the vector filled with the values, and softmin is the differentiable operator
 
Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!
Evaluation
The algorithm was compared directly with Tagliasacchi et al. [2015]Tang et al. [2015] and Sridhar et al. [2013]. It was also compared with other state-of-the-art algorithms. The accuracy is calculated by reporting the number of frames for which either the average or the maximum marker position error was less than a certain threshold.
Report of the evaluation of Taylor et al. [2016] against Tagliasacchi et al. [2015], Tang et al. [2015], Oberweger et al. [2015], Tompson et al. [2014] and Sridhar et al. [2013]. The dataset used is the DEXTER dataset
Figure 12: Report of the evaluation of Taylor et al. [2016] against Tagliasacchi et al. [2015], Tang et al. [2015], Oberweger et al. [2015], Tompson et al. [2014] and Sridhar et al. [2013]. The dataset used is the DEXTER dataset

Conclusion and future work
In this article, we have given a detailed survey of the conventional as well as the recent progress on 3D hand tracking. We discussed a variety of methods that were used to track the human hand and their advantages with disadvantages. The survey included progress made by the different type of sensors used, model types and the vision-based algorithms used. The major problem at hand is to track the hand in real time with efficient algorithms that do not require heavy computations and expensive resources. Two of the recent papers were seen in detail and both of them do not provide clear answers as to how to solve the occlusion problem or changing backgrounds. The output of these models also do not take the human psychology into account when devising or implementing. In the future, the psychophysical parameters related to human acuity and hand movements could be taken into account and a clear algorithm that can enable one to track the human at a level in which the human perceives their hand in a virtual environment.
 

About the Author

AuthorVikram Singh: Joseph Isaac    Webpage

 

About me: I am a PhD scholar at Indian Institute of Technology-Madras. I like to write about Apple products, Virtual Reality, 3D Hand Tracking and other such stuff related to Technology. I would love to resolve your queries, regarding this article. Please leave your questions / suggestions in the comments section below. I will reply to them as soon as possible.

 

 

 

Dear Reader,

  1. If you find any mistake in this article, then inform us immediately so that we can correct it.
  2. If you find this article helpful then please LIKE our Facebook page, SHARE this article on your social media accounts and leave your COMMENT with feedback, questions, appreciation, suggestions, concern etc.
  3. Please remember that sharing is caring.

Code: 5 4514
 
You Might Like
You Might Like