Article Index


Recognition is the process in which the data from the sensors are used to find the position and orientation of the hand in 3D space. The kind of algorithm is heavily dependent on the type of sensor used for the tracking. In this article, we will focus on the vision-based tracking and its types as they are robust and has advantages over the other sensors.

Vision-Based Tracking Techniques

Vision based tracking finds an object in a frame (each successive image in a video) and notes the course of that object in the video. This kind of algorithm is different from a vanilla object detection (or localization) since the information from the previous frame is used to find the object in the next frame. Therefore it is more efficient and stable than finding the location of the object by scanning the whole frame repeatedly. This used to be a difficult process as there are many computations that should be performed in just a few milliseconds for each frame.

The human eye can differentiate one image from another image in a video if the time between frames is 67 ms. If the next frame arrives below this time then the human eye will treat that image as a continuation of the previous image i.e. a motion sequence. Hence for an effective tracking algorithm must find and track an object in each frame at this time. However, with the advent of fast and capable graphics processing units (GPUs), this speed is now possible.

Basic process of vision-based tracking

The overall process can be summarized as the following:

Step 1: The first step for tracking an object is initialization. In this step, the position of the object and the parameters for the algorithm are initialized. The method may be manual or automatic.
Step 2: On arrival of the next frame at time t, search for the object locally in a region R provided the position P of the object at time t-1.
Step 3: When found, the new position of the object becomes P.
    Step 2 and 3 will repeat until the end of the input video.

Sometimes during tracking or at initialization itself, when there is a small error in the output, that error will tend to propagate through each frame, accumulating and hence affecting the performance of the output. This phenomenon is called drifting. Resolving this issue of drifting is a challenge since doing it automatically usually requires auto reinitialization of the parameters for the algorithm.

Classifications of vision-based tracking
According to Erol et al. [2007], hand tracking techniques
can be divided into two broad techniques:
  • Appearance based techniques
  • Model based techniques
1. Appearance based techniques
Also known as view-based techniques, this technique processes the video of the hand as a sequence of poses and classifies them based on the features derived from it. Hence the quality and the type of features which are taken from the image greatly affects the accuracy of the output. A disadvantage of this technique is that it processes only the 2D models is can not track the exact location and the position of the hand. Hence it is only used for gesture-based interactions such as selecting items on a screen menu, sliding an image, playing music with actions of the arm, etc.

Skin segmentation is a well-known example of appearance-based tracking and it is the process in which the colour of the skin is used to perform segmentation of the image and extract the hand for processing. This method is used by Poudel et al. [2013], Tang et al. [2015] and by Sun et al. [2014]. 
2. Model based techniques
In this kind of technique, a 3D model of the human hand is created and will then be used to show the position and orientation of the actual hand in 3D space. The model will follow the kinematics and structural properties of the human hand. Most of the recent publications are made with the 3D model having 27 degrees of freedom (hereafter called DOF). However, some constraints can be enforced on this DOF based on the actual bionics of the hand. The DOF can also be decreased depending on the application domain of the method.

A model with an X-ray of the actual human hand. The square box joint in the 3D model represents 6 DOF of the hand position and orientations. Black circles represent 2 DOF movements like abduction or spreading. White circles represent 1 DOF movement which is flexion
Figure 4: A model with an X-ray of the actual human hand. The square box joint in the 3D model represents 6 DOF of the hand position and orientations. Black circles represent 2 DOF movements like abduction or spreading. White circles represent 1 DOF movement which is flexion. Image courtesy Poudel et al. [2013]
Take a Break and Laugh with CompleteGATE Fun Bytes. Fun Bytes is a collection of comic / funny / humorous content, Especially for you!
The human hand comprises of 27 bones (See Figure 4). The size of each of these bones depend on the user and varies from individual to individual. The names of the joints are based on the connecting bones. The type of joints and their degrees-of-freedom are described below:
  • Interphalangeal Joints (IP): This joint connects each finger phalanges. There are three types of phalanges namely the Distal, Middle and Proximal phalanges. It has one DOF for flexion. For an average human hand, each of these joints can only bend to about 90 degrees at maximum and hence noted by some methods when tracking to prevent infeasible hand position.   
  • Metacarpophalangeal Joints(MCP): This joint connects each finger with the palm. It has two DOF, one for abduction or spreading of the fingers and the other one for flexion.
  • Trapeziometacarpal Joints (TM): This joint connects the thumb to the palm. It has a different form of movement from the CMC and hence is difficult to model. However, it is usually modelled as 2 DOF.
  • Carpometacarpal Joints (CMC): This type of joint connects the metacarpals of the fingers with the wrist.

The model-based techniques create a 3D model of the hand and predict an orientation for it. It will then try to match the predicted model with the observed data. This is a computational step and hence a difficult task. Generally, 3D hand parameters are estimated by edges or depth matching on each frame.

A clear example of this technique is the one from Sharp et al. [2015]. They used the output from a Kinect camera to render a 3D model of the hand and fit it to the data.

Some papers have used methods which incorporate the 2D features from the appearance-based tracking and the 3D features from a model based tracking to fuse and for a unified framework that will effectively track the hand with relatively smaller errors. Sridhar et al. [2013], Sun et al. [2014] and El-Sawah et al. [2008] used features which were derived from the 2D model to initialize the position of the 3D model and track the hand.
Code: 5 4514
You Might Like
You Might Like