Stereo Depth Perception on the Raspberry Pi – Electronic design services with Argon Design Ltd.

Stereo Depth Perception on the Raspberry Pi

Skill sets used

Multimedia, Image and Video Processing, Algorithms and mathematics, Processor architectures, Software engineering

The Goal

With the announcement by Rasperry Pi of the ultra-low-cost Pi Compute Module (see it is now possible to create volume products using the Rasperry Pi, building on its easy programming environment, simplicity of attaching hardware and wealth of developer resources.

The Pi has powerful multimedia capabilities, but whereas general programming on the Pi is straightforward, getting the best out of the multimedia system and programming the VideoCore blocks that make it up is more complex.

Argon Design is expert in programming these systems and can turbocharge multimedia algorithms running on the Pi. To mark the availability of the Compute Module and illustrate our skills, we have produced a demonstration.

The demonstration is of stereo depth perception, i.e. getting depth information from the different views seen by two cameras spaced a short distance apart. This also makes use of another exciting feature of the Pi Compute Module, which is its support for two cameras (the standard Pi only supports one).

Stereo Depth Perception

There are several good algorithms documented for depth perception in the literature. Many of these share similarities with video compression algorithms, a field with which Argon Design has a great deal of experience. Both are based on dividing images into blocks and, for each block in one image, searching for "matching" blocks in one or more other images.

However, we need much more reliable results for depth perception than for video compression. The literature also documents many ways to improve the output of the basic algorithm. As this was a proof-of-concept, our criteria were to choose improvements which would not require too much time to implement, which were not too computationally expensive and which gave decent quality improvements. The basic differences between block-based video compression and our final algorithm are:

  • Compare two images taken at the same time, from two cameras with a known distance between them. This means that we are measuring parallax, from which the distance to the camera can be determined. For video compression we would instead compare images from the same camera at different times, so the algorithm would instead measure motion
  • Assume that the cameras are aligned horizontally. This reduces a 2D search to 1D, saving a huge amount of processing time. This requires the cameras to be calibrated, but there are known methods to do this either manually or automatically
  • Use a more accurate (but more expensive) measure of correlation between two blocks, based on the following two papers:

    Specifically, we decided on a combination of the "C5" correlation function from [2] with the multi-window scheme of [1]. Our program can use either the 5x5 windowing scheme or the 7x7 scheme - overall, the 7x7 version takes longer to calculate but is slightly more accurate

There are more sophisticated "global" schemes for stereo depth, such as Semi-Global Matching, Belief Propagation and Graph-Cut. However these schemes are too complex to run in real-time without specialised hardware.


The implementation went through three different versions:

  • An original version using Python and NumPy, which was used to evaluate the accuracy of the various alternatives
  • A version in C, with as many algorithmic improvements made as possible during translation. The main improvements came from avoiding redundant calculations through careful code arrangement, as well as replacing a generic sorting algorithm (used by the multi-window scheme) with a custom sorting network which was around 4 times as fast
  • A version in VideoCore VPU assembler, which follows the structure of the C version but which exploits the VPU's architectural features (for example its 16-way vector unit and 16KB of processor-local memory)

The approximate processing times for a moderate-sized image (768x576 pixels) in each version were as follows:

  • Python, on x86 (~3GHz): 63 seconds
  • C, on x86 (~3GHz): 56 milliseconds
  • C, on Pi ARM core (700MHz): 1 second
  • Assembler on VideoCore VPU (250MHz): 90 milliseconds

Note that the VideoCore version took only around 50% longer than the x86 C version on a processor with around 12x the clock speed. This demonstrates the improvement which can be had by using a specialised digital signal processor.

We also created a demo application which would read images from the two cameras on the Compute Module, process them and display the calculated depth as a colour map. The camera processing and display added a small amount of overhead, so to compensate we reduced the image size to VGA (640x480). This resulted in a final framerate of 12fps. This is sufficient for a proof-of-concept and shows the sort of image processing tasks that can be implemented with reasonable speed on the Raspberry Pi.

Please don’t ask us if we can give you the code for your own project. The demo is not in a form that is suitable to be released and we can only provide advice on stereo algorithms on a commercial basis.


The demonstration application we produced can display the results in three different ways:

  • Display the original camera images only
  • Convert parallax data to a colour rainbow and overlay onto the image
  • Remove all points at least a certain distance from the cameras

Screenshots of the results are below, along with some photos of the equipment used.

Stereo Depth Perception - 2D image Stereo Depth Perception - distance 'heat map' Stereo Depth Perception - distance cut out Stereo Depth Perception - Raspberry Pi platform Stereo Depth Perception - Raspberry Pi platform
However, these are far from the only possible applications. Others include:

  • Stitching together images from two or more cameras to form a panorama
  • A previous project of Argon's was to use an Extended Kalman Filter to separate the motion of the camera(s) from that of the objects in a scene. These could be combined to form dynamic, 3D maps of an environment
  • The background removal demonstration currently cuts sections out of some objects. This could be remedied by combining the results with those of an edge-detection filter. This would then allow us to do background substitution without the need for a green screen
  • This algorithm is very well-suited for execution on an FPGA, as most of the calculations can be done in parallel. Thus any of the above could easily be scaled to work with 1080p30 or 4K video in real time

Related case studies

Contact us

Do you have a project that you would like to discuss with us? Or have a general enquiry? Please feel free to contact us

Contact us