How Machine learning is leveraged in gaming consoles?

Mukul Keerthi
4 min readNov 18, 2020

With the holiday season just around the corner and the discussion about which console should you opt for; Is it the Sony Playstation or the Microsoft Xbox? While there is a lot of buzz surrounding that, one interesting research which I recently came across explaining how machine learning is used in these devices? In this article I will explain how ML models are used to leverage the impressive hardware that these consoles posses — In particular the Microsoft Xbox with it’s kinect motion detection device.

Microsoft Kinect which uses motion detection in gameplay

Microsoft kinect is a gaming system which uses motion of the users instead of traditional controller devices like a gamepad. It’s three camera sensors and four mics help Kinect recognize who is standing in front of it and detects their body motion.

Random Forest model

Before proceeding to know about the random forest model, let’s take a look at the decision trees which is the building block of the Random forest model. As the name suggests it involves separating our dataset based on their features. In the example shown below it can be seen that a set of numbers is classified based on their features such as the colour, whether the numbers are underlined and so on. Of course the dataset that we use in real life may not be as simple as the example, however the logic remains the same. At each node, the observations are split in a way that the resulting subgroups are as different as possible to each other.

Simple decision tree

Random Forest, as the name suggests consists of a large number of individual decision trees that function as an ensemble. Each individual tree in the random forest gives out a prediction, and the prediction with the most number of votes will become the prediction of the model.

The Random forest model is based on the principle of wisdom of crowds — A large number of relatively uncorrelated models acting as a committee will outperform any individual constituent models[1]. The low correlation between the models is the key. Uncorrelated models produce ensemble predictions which are more accurate than any of the individual predictions. As a result, the trees protect each other from their individual errors as long as they all constantly don’t err in the same direction.

Use in Microsoft Kinect

The Microsoft Kinect uses the 3D positions of the body joints from a single depth image using no temporal information. The large and highly varied dataset allows the classifier to estimate body parts invariant to body pose, shape or clothing. Finally a confidence scored 3D proposal of several body joints by reprojecting the classification results [1]. This approach is mainly driven by keeping mainly two key design goals in mind: computational efficiency and robustness. A single input depth image is segmented into a dense probabilistic body part labeling with the parts defined to be spatially localized near joints of interest. By reprojecting the inferred path in space, several confidence weighted proposals are generated for 3D locations of each skeletal joint.

Single input depth image containing per pixel body part distribution

A deep randomized decision forest classifier containing hundreds of thousands of training images is trained so that overfitting can be avoided. It also yields 3D translation invariance while maintaining high computational efficiency. Depth cameras are used to reduce invariant textures and colours and to resolve silhouette ambiguities in pose. The combination of decision forests helps to differentiate which part of the body the pixel actually belongs to. Also, it is computationally efficient since no preprocessing is needed and each feature needs to be read at most 3 pixels and at most 5 operations[2] which can be implemented on the GPU in a very straightforward manner.

The accuracy of classification depends on multiple factors. The accuracy increases logarithmically with the increase in the number of training images although it starts to tail off at around 100K images. Also silhouette images negatively affects the accuracy of the prediction. The depth of the classification trees as expected has the most impact as seen in the figure below, the test set with 900K training images performs much better than the test set containing 15K training images. The range of depth effect offsets allowed during training also has a large effect on the accuracy.

Training parameters vs accuracy in classification

Conclusion

We have seen how single depth images can be used to obtain accurate proposals for 3D locations of body joints and the use of body part recognition for human pose estimation. The use of highly variable synthetic training set has allowed training deep decision trees without overfitting, learning invariance to shape as well as the pose.

Reference

[1] https://towardsdatascience.com/understanding-random-forest-58381e0602d2

[2]https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/BodyPartRecognition.pdf

--

--

Mukul Keerthi

The author works as an embedded software engineer in the lovely mid-west of Ireland!