Recognition is the process in which the data from the sensors are used to find the position and orientation of the hand in 3D space. The kind of algorithm is heavily dependent on the type of sensor used for the tracking. In this article, we will focus on the vision-based tracking and its types as they are robust and has advantages over the other sensors.
Vision-Based Tracking Techniques
Vision based tracking finds an object in a frame (each successive image in a video) and notes the course of that object in the video. This kind of algorithm is different from a vanilla object detection (or localization) since the information from the previous frame is used to find the object in the next frame. Therefore it is more efficient and stable than finding the location of the object by scanning the whole frame repeatedly. This used to be a difficult process as there are many computations that should be performed in just a few milliseconds for each frame.
The human eye can differentiate one image from another image in a video if the time between frames is 67 ms. If the next frame arrives below this time then the human eye will treat that image as a continuation of the previous image i.e. a motion sequence. Hence for an effective tracking algorithm must find and track an object in each frame at this time. However, with the advent of fast and capable graphics processing units (GPUs), this speed is now possible.
Basic process of vision-based tracking
The overall process can be summarized as the following:
Step 1: The first step for tracking an object is initialization. In this step, the position of the object and the parameters for the algorithm are initialized. The method may be manual or automatic.
Step 2: On arrival of the next frame at time t, search for the object locally in a region R provided the position P of the object at time t-1.
Step 3: When found, the new position of the object becomes P.
Step 2 and 3 will repeat until the end of the input video.
Sometimes during tracking or at initialization itself, when there is a small error in the output, that error will tend to propagate through each frame, accumulating and hence affecting the performance of the output. This phenomenon is called drifting. Resolving this issue of drifting is a challenge since doing it automatically usually requires auto reinitialization of the parameters for the algorithm.
can be divided into two broad techniques:
- Appearance based techniques
- Model based techniques
Skin segmentation is a well-known example of appearance-based tracking and it is the process in which the colour of the skin is used to perform segmentation of the image and extract the hand for processing. This method is used by Poudel et al. , Tang et al.  and by Sun et al. .
- Interphalangeal Joints (IP): This joint connects each finger phalanges. There are three types of phalanges namely the Distal, Middle and Proximal phalanges. It has one DOF for flexion. For an average human hand, each of these joints can only bend to about 90 degrees at maximum and hence noted by some methods when tracking to prevent infeasible hand position.
- Metacarpophalangeal Joints(MCP): This joint connects each finger with the palm. It has two DOF, one for abduction or spreading of the fingers and the other one for flexion.
- Trapeziometacarpal Joints (TM): This joint connects the thumb to the palm. It has a different form of movement from the CMC and hence is difficult to model. However, it is usually modelled as 2 DOF.
- Carpometacarpal Joints (CMC): This type of joint connects the metacarpals of the fingers with the wrist.
The model-based techniques create a 3D model of the hand and predict an orientation for it. It will then try to match the predicted model with the observed data. This is a computational step and hence a difficult task. Generally, 3D hand parameters are estimated by edges or depth matching on each frame.
A clear example of this technique is the one from Sharp et al. . They used the output from a Kinect camera to render a 3D model of the hand and fit it to the data.
Some papers have used methods which incorporate the 2D features from the appearance-based tracking and the 3D features from a model based tracking to fuse and for a unified framework that will effectively track the hand with relatively smaller errors. Sridhar et al. , Sun et al.  and El-Sawah et al.  used features which were derived from the 2D model to initialize the position of the 3D model and track the hand.