Depth Anything V2, a highly capable depth estimation model

Depth Anything V2 is a new powerful monocular depth estimation model, delivering significantly more detailed and accurate depth predictions than the first version. It outperforms the latest Stable Diffusion-based models such as Marigold and Geowizard, being over 10x faster and more accurate.

The model is available in 4 sizes ranging from 25M to 1.3B parameters. This scalability makes it adaptable to various tasks with different performance and accuracy requirements.

The Small model is released under the Apache License 2.0 and can be used for both non-commercial and commercial purposes, while the Base, Large, and Giant models are released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC-BY-NC-4.0), restricting their use to non-commercial purposes only.

You can download the model from its GitHub repository. The repository also includes detailed instructions for training your own models using your image datasets. If you’re interested in a custom dataset, you can visit Roboflow 100. Additionally, there is an online demo available here.

The figure below highlights the benefits of Depth Anything V2 over Depth Anything V1 and Marigold: faster inference speed, better depth accuracy, and fewer parameters.

Depth Anything V2 significantly outperforms V1 and Marigold models (source: paper)

Monocular depth estimation and its critical role in AI vision

Depth Anything V2 is a new approach to the field of monocular depth estimation (MDE), offering more detailed and accurate depth predictions (which is crucial for various applications in computer vision).

MDE is a technique that allows computers to predict the distance to objects in a scene from a single image, similar to humans’ perception of depth. Accurate depth information is essential not only for traditional tasks like 3D reconstruction, navigation, and autonomous driving, but also for emerging areas like AI-generated content (images, videos, and 3D scenes).

The MDE models can be categorized into two approaches:

  1. Discriminative models like BEiT, DINOv2, and Depth Anything are designed to predict depth directly from image features. They are faster and require less training time, which makes them suitable for real-time applications.
  2. Generative models, such as Stable Diffusion and Marigold, learn to reconstruct the entire scene with depth information. They can handle complex scenes with higher accuracy. However, they usually require more computational resources and larger datasets for training.

The figure below showcases two models from each category, in open-world images: Marigold (generative) and Depth Anything V2 (discriminative). Marigold excels at capturing intricate details, while Depth Anything V2 provides more robust predictions in complex scenes.

Comparison between Marigold and Depth Anything V2 in open-world images (source: paper)

The new MDE model, Depth Anything V2, aims to achieve all the properties listed in the table below.

Preferable properties for MDE (source: paper)

Death Anything V2 model architecture

Depth Anything V2 uses a teacher-student model architecture, where a pre-trained, powerful teacher model is used to guide the training of a smaller, more efficient student model. The teacher model, trained on a rich dataset, transfers its knowledge to the student model, enabling it to achieve good performance with less computational resources.

Based on this architecture, the training process of Depth Anything V2 follows 3 steps:

Depth Anything V2 pipeline (source: paper)
  1. Train a powerful teacher model (DINOv2-G) on high-quality synthetic images that include precise depth details.
  2. Use the teacher model to annotate unlabeled real images. These pseudo-labeled real images bridge the gap between the synthetic training environment and real-world application.
  3. Train the student model (DINOv2-S) on high-quality pseudo-labeled real images.

This training strategy bridges the gap between 2 data types: synthetic images with perfect depth information and large-scale, unlabeled real-world images.


The datasets comprise 595K synthetic images to train the initial largest teacher model and 62M real pseudo-labeled images to train final student models. To ensure consistency during training, all images are resized to 518×518 pixels. For more details, see the table below.

Depth Anything V2 datasets (source: paper)

A New Evaluation Benchmark: DA-2K

The team proposed a new evaluation benchmark (DA-2K) to address two main issues with existing test sets: limited diversity and frequent noise (irrelevant or extraneous data).

The figure below shows the annotation pipeline (a) and the 8 scenarios considered (b): indoor, outdoor, non_real, transparent_reflective, adverse_style, aerial, underwater, and object.

DA-2K evaluation benchmark (source)

The annotation pipeline (a) starts by sampling point pairs in the image based on a segmentation tool (SAM). Then, it compares the depth predictions from four different models (Depth Anything V1, Depth Anything V2, Marigold, and Geowizard) for these chosen pairs of points. If any of the four models disagree about which point is closer, those pairs are flagged for human experts to label with the correct relative depth information.

Experimental results

The qualitative comparison between Depth Anything V1 and V2 shows that V2 produces much more fine-grained depth predictions than V1, being also highly robust to transparent objects.

Depth Anything V1 vs V2 qualitative comparison (source: paper)

The figure below shows a qualitative comparison between Depth Anything V2 and ZoeDepth, which was trained on real datasets.

Depth Anything V2 vs ZoeDepth qualitative comparison (source: paper)


Depth Anything V2 surpasses its predecessors by delivering more robust and fine-grained depth predictions. Its capabilities open new possibilities in various fields, including robotics, autonomous vehicles, 3D reconstruction, and augmented reality.

Read more:

Other popular posts