In the context of hand tracking, data is provided to the system in two ways:
A sensor takes data using physical properties of the outside world and converts it to digital information that can be processed by electronic systems. For 3D hand tracking, sensors are of mainly three types.
- Mount based sensors
- Multi touch sensors
- Vision based sensors
Mount based sensors
Mount based sensors are ones which are worn (or mounted) on the hand and they provide data to the system. Examples are accelerometers and gyroscopes which can track the relative orientation and position from a reference point for each finger and hand [Prisacariu et al. 2012]. There are different methods for placing the sensors on the hand, such as placing them on each digit of the finger or on strategic locations to calculate the remaining positions. It is highly accurate and capable of tracking the hands to sub-millimetre levels, but it is uncomfortable to wear at times and prevent users from feeling the actual environment. It is also very expensive and cannot be afforded at a larger scale for many people.
Multi-touch screen sensors are commonly used in smartphones. They record the point of contact of the human hand and the device. Common examples include the action of pinching which is usually to zoom a selection on screen or drag two fingers in a parallel direction to scroll through a document. Although accurate at certain tasks, a disadvantage with this type is that it only tracks the position of the tip of the fingers or any part of them in contact with the device and doesn't track the position of the hand itself or its orientation in space. Hence this kind of sensor is useful only for gesture-based interactions and not for tracking the whole hand.
Vision based sensors
Vision-based sensors capture images in the form of frames and send it to the system. They are useful for tracking as they do not require (for some types of sensors) electronic devices to be worn on the hand. They can also provide a larger distance from the user to the screen, unlike the multi-touch sensors. They can be broadly classified into 2D sensors such as the common webcams, and 3D depth sensors like Kinect and Leapmotion. The former only records the image while the latter also records the depth spectrum of the image. This enables a 3D perspective of the image that can be tracked more efficiently. However, the algorithms for these type of sensors is, in general, computationally expensive. In order to reduce the complexity, colour gloves are used that can be easily recognised by the system for tracking. However, this method comes with the same disadvantage as the mounted sensors as they interfere with the comfort of the user and their sense of touch.
As shown by the works of Tagliasacchi et al , using only the data provided by the sensors will be inaccurate, due to the noise created by the sensor itself or other external factors. Hence the output sometimes even creates hand poses that are not actually plausible by a real human, such as a finger bent in the wrong direction. This kind of problem is tackled by the use of datasets. These datasets have a set of hand positions that are plausible by the human hand and a mapping that will tell which is the next likely hand position given one. An example of such a dataset is the one made by Schroder et al.  which is made as a public dataset and is used by Tagliasacchi et al  for their algorithm. This will be discussed in detail in a later section.