Image classification - assigning an image to a category based on its visual content - is a general problem with many applications. Examples include the fields of object recognition, remote sensing, content-based indexing and quality control. Argon Design has been involved in the development of a system designed to enable users to create realistic 3D virtual avatars of themselves based on video from their mobile phone camera.
To provide a truly personal experience, the process involves creating an accurate 3D model of the user’s face. To create the face model, users have to take a short video of their head turning from one side to the other, typically using the front facing camera of a mobile device. This video is then input to various computer vision algorithms (Structure from Motion, Texture Extraction), which extract a 3D mesh model of the face and a corresponding texture map.
The accuracy of the extracted 3D model is dependent on the quality of the input video. This is due to the internals of the vision algorithms involved. The system must ensure that the end result is lifelike and pleasing, otherwise there is a risk of the user abandoning the service. It is also important that the user can achieve the desired results without a large amount of trial and error, which again risks causing frustration and ultimately results in bad user experience.
We created a real-time feedback mechanism to assist the user in the acquisition of good quality video. Prior to starting the video capture, the user is presented a live view of the camera feed. The feedback mechanism analyses the live view, and if necessary, provides the user with recommendations on how to improve the conditions in order to achieve a high quality end result.
After analysing the sensitivities of the model extraction algorithms we can identify two broad groups of input quality issues: illumination related problems and contextual problems.
Incorrect illumination can cause problems both for structure and texture extraction. For example, the location of strong highlights on the face is dependent on the direction of the incident light, and tends not to move together with the rest of the face landmarks as the user rotates their head on the video. This effect is problematic for structure extraction algorithms as the highlights can be misinterpreted as static landmarks, or obstruct real facial features. Similarly, strong directional lighting, or light sources with high colour content can result in uneven and unnatural skin tones after texture extraction.
Contextual problems cause difficulty mostly during structure extraction and arise due to the assumptions and limitations of the algorithms involved. For example, if the user’s fringe is covering a portion of their forehead, or if the user is wearing glasses, these structures will be incorporated into the extracted 3D mesh, which as a result shows little resemblance to the shape of a human face.
Quality analysis of the input image is then an image classification problem. We must decide whether any of the problematic conditions is present in the input image. Given enough reference data, we can use machine learning techniques to train classifiers that identify these quality issues. Availability of reference images already classified through other means (typically by humans) is a key necessity when applying machine learning. While it is trivial for a human to judge whether someone is wearing glasses or not, it is more difficult to objectively assess illumination problems when manually classifying reference input. Also, the quality analysis must run sufficiently fast on a mobile device to provide real-time feedback during the live video preview. This performance requirement limits our use of some computationally expensive machine learning techniques.
The trade-off we have made is to use machine learning techniques to identify contextual issues, and use heuristics for illumination-related problems. Contextual issues tend to vary relatively slowly (e.g. it is unlikely that the user will keep taking their glasses on an off at high frequency), and hence separate issues can be analysed during alternate frames. Observed illumination can change faster, for example as the user moves through a room, or as the automatic exposure and white balance control of the camera adapts to the lighting conditions. This necessitates analysing illumination at a higher rate, in order to keep the system responsive. We can utilize a lot of prior information based on knowing that the input image contains a frontal face. We can for example use average facial proportions and shape, average chromaticity of skin colour and typical skin texture in specific areas.
Putting it all together, we have implemented the quality analysis as a multistage algorithm, which analyses the live video preview one frame at a time, but utilizes the inter frame correlation for increasing efficiency. The output is a set of scores in a predefined range which are indicative of the presence of a quality issue. The steps the algorithm follows are roughly:
The problem of computing an improvement recommendation can be considered another classification problem. The inputs in this case are the scores from the image quality analysis algorithm, and the possible output classes represent the most pressing problem with the image. This final classification can then be used to provide the user with feedback on how to alter the input to achieve good results.
We have implemented our quality analysis algorithm as a native library, which can be used by application developers to integrate into any application that would benefit from its functionality. We also developed a demo application to test the responsiveness and integration of the algorithm with a live video preview. We make use of the BSD licensed OpenCV library, which enables us to take advantage of optimised implementations of image processing and computer vision algorithms developed by the community. While not a mandatory dependency, our own algorithms can also take advantage of a high performance BLAS library to increase performance. As a result, our library is capable of analysing at typical video framerates on a modern mobile platform.
The use of machine learning for image classification is a well recognised technique. There are applications in many fields including security, machine vision, automotive and healthcare. A couple of recent examples in that last category are:
As demonstrated in this case study, Argon Design has the skills to identify appropriate algorithms and to develop implementations for operation within particular platform constraints in the real world.
Do you have a project that you would like to discuss with us? Or have a general enquiry? Please feel free to contact usContact us