TripoSR creates detailed 3D objects from single images in split seconds

TripoSR is a new open-source 3D modeling tool that reconstructs 3D objects from a single image in under 0.5 seconds. It is designed to be available to all users for commercial, academic, or personal projects and does not require a GPU.

This fast image-to-3D model addresses the complex demands of professionals in the entertainment, gaming, industrial design, and architecture industries, providing detailed 3D visualizations of objects.

TripoSR, a 3D reconstruction model that reconstructs high-quality 3D from single images in under 0.5 seconds (source: paper)

TripoSR is released under the MIT license. You can freely use it for your projects, whether they’re for work, study, or just for fun. You can download the model weights and source code here, as well as try out a demo.

Model overview

TripoSR is a rapid 3D generation model that follows the LRM network architecture (see the next picture). LRM stands for Large Reconstruction Model and was created by Adobe Research. It predicts the 3D model of an object from a single input image within just 5 seconds. TripoSR enhanced its data handling, architectural design, and training. It comprises three primary elements:

  1. Image encoder that uses a vision-transformer-based architecture to convert an RGB image into a series of latent vectors. They encapsulate the global and local properties of the image.
  2. Image-to-triplane decoder that converts the latent vectors into a triplane representation. The triplane is a new way to represent the 3D space using three 2D feature planes.
  3. Triplane-based neural radiance field (NeRF) that uses the triplane representation to predict color and density at various points in the 3D space. These predictions are then used for volumetric rendering, which ultimately produces the 3D mesh from the initial 2D image input.
The overall architecture of LRM (source: LRM paper)

A triplane representation is an image projected onto a 3D space. This projection is typically conditioned on camera parameters, which describe the position and orientation of the camera relative to the scene, as well as the intrinsic properties of the camera, such as its focal length and lens distortion.

Notably, the model is not given the camera parameters explicitly. Instead, it is allowed to “guess” them during training and inference. The model is trained to learn the relationship between the image and the triplane projection, and to infer the camera parameters based on that relationship. This approach allows the model to be more flexible and capable of handling a wide range of real-world scenarios without the need for precise camera information.

TripoSR main improvements over the LRM include fine-tuning the number of channels (1) for better information processing, the use of mask supervision (2) to provide extra training guidance, and an upgraded crop rendering (3) approach. This latter improvement enables the model to more effectively interpret and learn from incomplete views of objects, a common occurrence in real-world images where the full object isn’t always captured.

Training

TripoSR was trained for 5 days on 22 GPU nodes, each equipped with 8 A100 40GB GPUs. To enhance training efficiency, they rendered 128×128 patches from the larger 512×512 original images, with an emphasis on the foreground areas. This approach allows the model to concentrate on the most important regions for reconstruction, offering a balance between computational efficiency and the retention of detailed surface features.

The training data consists of a carefully chosen subset from the Objaverse dataset, using diverse rendering techniques that better simulate the range of images seen in everyday life.

Evaluation

TripoSR was evaluated using quantitative and qualitative metrics and compared to previous state-of-the-art methods. They used two different datasets for 3D reconstruction: GSO and OmniObject3D.

TripoSR quantitative comparison with other state-of-the-art methods (source: paper)
Quantitative comparison of different techniques on GSO validation set, where CD and FS refer to Chamfer Distance and F-score respectively (data source: paper)
Quantitative comparison of different techniques on OmniObject3D validation set (data source: paper)
Qualitative results of TripoSR output meshes compared to other state-of-the-art methods on GSO and OmniObject3D (source: paper)

TripoSR demonstrates its ability to generate detailed 3D models in approximately 0.5 seconds, surpassing the capabilities of other open-source image-to-3D models, such as OpenLRM.

Conclusion

TripoSR’s high performance in generating high-fidelity 3D models from single images with exceptional speed and accuracy opens up new possibilities across various industries. The model offers 4 main advantages: it is open-source, fast, accurate, and does not require a GPU.

The developers encourage other programmers, designers, and innovators to participate in its ongoing development.

Learn more:

Other popular posts