Google introduced NeRF-Supervised, a new framework for training deep stereo networks without requiring any manually labeled ground-truth data. The overall goal of the NeRF-Supervised Deep Stereo pipeline is to generate accurate 3D reconstructions of a scene from a set of 2D images taken with a single handheld camera.
Deep stereo networks are used to estimate the depth or disparity between objects within a scene, by taking in a pair of images as input. The resulting depth or disparity map can then be utilized to generate 3D reconstructions of the scene.
The primary drawback of traditional stereo vision techniques is that they require the use of multiple cameras to capture depth information, which limits their usefulness.
By utilizing the NeRF-Supervised framework, it is possible to easily train the deep stereo networks, without the need for any ground-truth data.

Method
The team collected sparse sets of images using smartphones equipped with standard cameras. After collecting multi-view images of a single scene from multiple perspectives, they utilized Neural Radiance Fields (NeRF) to fit each scene and generate stereo triplets along with depth information. Finally, the team employed the generated data to train a Deep Stereo network to produce accurate and reliable depth maps for a variety of real-world scenes and objects.
NeRF is a deep learning-based approach that can model the scene’s 3D geometry and appearance by learning a volumetric representation of radiance fields from 2D images.
Deep Stereo, on the other hand, is a technique that uses a convolutional neural network to estimate the disparity map between two views of a scene.
There are three primary steps in the pipeline:
- Synthetic data generation by collecting a set of 2D images taken from different viewpoints of a single static scene. The training data was generated by collecting 270 high-resolution scenes of indoor and outdoor environments using smartphones, focusing on specific objects and capturing 100 images from different viewpoints.
- Stereo triplets rendering using NeRF. Instant-NGP – the NeRF engine – was trained for 50K steps to render stereo triplets and depth maps for each scene, resulting in a total of 65K triplets.
- Training the stereo network by using the rendered stereo triplets and depth maps, as training data. The stereo network learned to predict accurate disparity maps from the rendered data, which can be used for various 3D reconstruction tasks.

Test results
The results show that the NeRF-supervised stereo networks outperform both self-supervised and supervised methods, achieving state-of-the-art performance on the Middlebury dataset, a benchmark for stereo vision research.
They also compared their results to methods that were trained on synthetic data with ground-truth and found that the NeRF-Supervised pipeline performed equally well, even though it did not use any ground-truth data.


Conclusion, future research
NeRF-Supervised Deep Stereo is a new pipeline to generate stereo training data from a sequence of images captured by a single regular camera. The approach can be also used to generate labels for other tasks, such as optical flow or multi-view stereo.
The authors state that their approach has limitations when it comes to complex scenes, such as transparent surfaces and nighttime scenes. The samples used in the research are constrained to scenes that are static and of small-scale.
A larger-scale collection of stereo training datasets, coupled with other NeRF variants, may deal with these challenges in the future.
Learn more:
- Research paper: “NeRF-Supervised Deep Stereo” (on arXiv)
- NeRF-Supervised Deep Stereo on GitHub (code, examples)
Glossary
- Stereo data often referred to as “stereo pairs” or “stereo images”, refer to a type of image data that captures the same scene from two or more different viewpoints or angles. They are commonly used in computer vision and 3D reconstruction applications.
- Deep stereo networks are a type of artificial neural network used in computer vision to estimate the depth of a scene from two or more images. They are designed to mimic the way humans perceive depth through binocular vision, which involves comparing the differences between the images captured by the left and right eyes.







