Meta’s VGGT reconstructs 3D scenes in seconds [CVPR 2025]

VGGT (Visual Geometry Grounded Transformer) is an advanced AI model that is able to reconstruct 3D scenes directly from one or more of its views in a fast, single-step process.

Meta AI in collaboration with the Visual Geometry Group at the University of Oxford released VGGT, a feed-forward neural network designed to quickly extract all the 3D attributes from images and videos, including camera parameters, point maps, depth maps, and 3D point tracks. VGGT has multiple applications in 3D reconstruction, point tracking, and novel view synthesis.

The source code and pre-trained models are publicly available on GitHub, facilitating research and development in 3D scene understanding.

Contents

  1. Key points
  2. Overview of VGGT design and training
  3. Evaluation
  4. How to use VGGT
  5. Conclusion
  6. References

Key points

  • Standard transformer architecture: VGGT uses a standard transformer with alternating frame-wise and global attention for efficient processing.
  • Fast 3D inference: No geometry optimization needed – VGGT predicts camera pose, depth, and 3D points in a single step – completing the process in seconds.
  • Cutting-edge accuracy: Achieves top results on multiple 3D applications, including camera pose estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking.

Overview of VGGT design and training

VGGT uses a large, standard transformer architecture, which was originally designed for language tasks (e.g., BERT, GPT) and later adapted for vision (e.g., ViT). VGGT uses only attention mechanisms, unlike traditional 3D models that rely on convolutional layers or predefined depth information.

The picture below illustrates the processing steps:

VGGT architecture overview (source: project page)
  1. Input processing (DINO patching and tokenization): VGGT is fed with one or more images (up to hundreds) from different angles – like photos of a room or a street. These could be from a camera, phone, or video frames. The model uses DINO (a visual feature learning tool) to divide the images into smaller, non-overlapping patches which are turned into tokens. Camera tokens are then incorporated, providing details about the camera’s position and orientation.
  2. Attention mechanisms: The tokens (image + camera) go into a standard transformer (not custom-built for 3D). Inside the transformer, VGGT alternates between 2 types of attention: frame-wise attention (to catch details from each image independently) and global attention (to catch the relationships between all images).
  3. Output generation: The processed information is fed into two output heads: the Camera Head, which predicts the camera position and orientation, and the DPT Head, which generates the depth map (how far things are), point cloud (3D dots), and tracking points (moving objects).

This entire procedure is carried out in a single forward pass without post-processing and is completed within seconds on a high-performance GPU. For instance, processing 32 input views takes just 0.51 seconds. The table below presents the VGGT’s runtime and GPU memory consumption for various numbers of input frames.

Input Frames1248102050100200
Time (s)0.040.050.070.110.140.311.043.128.75
Mem. (GB)1.882.072.453.233.635.5811.4121.1540.63
Runtime and peak GPU memory usage across different numbers of input frames (source: paper)

What makes VGGT powerful for 3D reconstruction is its training on large publicly available datasets with 3D annotations. This extensive training enables accurate 3D scene understanding.

The training was conducted on 64 A100 GPUs over nine days.

Evaluation

VGGT was evaluated on several public datasets. including ScanNet, KITTI, CO3Dv2, and RealEstate10K. The model demonstrated state-of-the-art results in multiple 3D tasks, such as :

Camera pose estimation (the position and orientation of a camera relative to a specific reference frame or environment): VGGT outperformed competing methods across all metrics on the CO3Dv2 and RealEstate10K datasets.

The table below shows the evaluation results for camera pose estimation. The test was conducted on the RealEstate10K and CO3Dv2 datasets using 10 randomly selected frames. None of the tested methods were trained on the Re10K dataset. Runtime was measured on a single H100 GPU. Methods marked with denote parallel research efforts.

MethodsRe10K (unseen)
AUC@30 ↑ 
CO3Dv2
AUC@30 ↑
Time
Colmap+SPSG  45.225.3∼ 15s
PixSfM 49.430.1> 20s
PoseDiff  48.066.5∼ 7s
DUSt3R 67.776.7∼ 7s
MASt3R 76.481.8∼ 9s
VGGSfM v2  78.983.4∼ 10s
MV-DUSt3R 71.369.5∼ 0.6s
CUT3R 75.382.8∼ 0.6s
FLARE 78.883.3∼ 0.5s
Fast3R 72.782.5 0.2s
VGGT (Feed-Forward)85.388.2 0.2s
VGGT (with BA)93.591.8∼ 1.8s
Camera Pose Estimation on RealEstate10K and CO3Dv2 with 10 random frames; higher values indicate better performance (source: paper)

Multi-view depth estimation and dense point cloud reconstruction from multiple photos taken from different angles: VGGT was compared with other models such as DUSt3R and MASt3R. VGGT and DUSt3R work without knowing the exact camera positions (ground-truth camera), while MASt3R uses camera position data to help estimate depth. The results show that VGGT significantly outperformed these models. VGGT not only produced better 3D reconstructions, but it also did so much faster.

The image below provides more examples of VGGT’s point map estimation. For a clearer and more interactive experience, check out the model’s interactive demo.

The 3D information generated by VGGT from an aerial movie of
the Colosseum in Rome, Italy (source: project page)

3D point tracking (the specific points or objects within a three-dimensional space over time): VGGT’s tracking module successfully generated keypoint tracks for unordered sets of images depicting static scenes, outperforming DUSt3R (see the picture below).

Qualitative comparison of VGGT’s predicted 3D points to DUSt3R on in-the-wild images (source: project page)

Downstream task enhancement: Using VGGT as a pre-trained feature backbone significantly enhanced the performance of downstream tasks, including feed-forward novel view synthesis.

Qualitative examples of novel view synthesis. The top row shows the input images, the middle row displays the ground truth images from target viewpoints, and the bottom row presents the synthesized images by VGGT (source: paper)

Overall, during evaluation VGGT has demonstrated superior performance over transformer-based models like DUSt3R and MASt3R, as well as traditional geometry-based methods such as COLMAP and Structure-from-Motion (SfM) pipelines, in both accuracy and efficiency. This success proves that a simple design, when trained on large datasets, can outperform more complex models.

It’s worth noting that while VGGT demonstrated superior performance in various tasks, specific comparisons with models like NeRF and ViTs were not detailed in the available sources.

For deeper insights into the advancements in visual synthesis through generative models, explore our posts on Nvidia’s Video LDMs, CAT4D – a diffusion model for multi-angle 3D video generation, Depth Anything V2 – a powerful depth estimation model and TripoSR – an open-source 3D modeling tool reconstructs detailed 3D objects from single images in seconds.

How to use VGGT

The research team has made the code and models for VGGT publicly available. You can access it in these ways:

  • Access the official GitHub repository to find the source code, pre-trained models, and comprehensive documentation. You have to clone the repository, install the dependencies (Python, PyTorch, torchvision, numpy) and initialize the model. By following these steps, you can freely access and implement VGGT for your 3D computer vision tasks.
  • Try the demo: a demo of VGTT is available in the Hugging Face space, and you can actually upload photos and videos to generate a 3D model. Demo videos and photos are also available.

Note: VGGT runs fastest on high-end GPUs, especially when Flash Attention-3 is enabled. This optimization significantly accelerates the model’s inference times. For instance, on a single A100 GPU with FlashAttention-3, processing 20 frames can take approximately 0.3 seconds. ​Without FlashAttention-3, VGGT still operates efficiently but may experience longer processing times. For example, using FlashAttention-2, the same task might take around 0.4936 seconds on an A100 GPU (information source).

Conclusion

VGGT is an advanced AI model that combines machine learning and visual geometry in a single step to perform complex 3D reconstruction tasks. It is able to predict key 3D attributes such as camera poses, depth maps, point clouds, and point tracks with high accuracy and speed, often in under a second on a high-end GPU.

Future research might focus on extending VGGT for dynamic scenes or adapting it to larger datasets.

References

Recommended books to level up your knowledge

The above books use affiliate links. If you buy through them, we may earn a commission at no extra cost to you. Thank you for supporting the site!

Other popular posts