Create dynamic multi-angle videos with CAT4D diffusion model

December 20, 2024

CAT4D is a new AI model for creating 4D scenes from single-camera videos. It transforms a standard video into an interactive 3D experience, enabling viewers to explore the scene from any angle and moment in time.

CAT4D was proposed by a research team from Google DeepMind and Columbia University. It uses a multi-view video diffusion model to generate new views of a scene from any desired viewpoint and moment in time. In this way, you can see the scene from any angle, even if it wasn’t originally captured by the camera.

Visit the project page for sample videos and interactive demos. The interactive features require Chrome 130+ with WebGPU.

The model generates highly realistic and dynamic multi-angle videos, outperforming current leading benchmarks, such as 4DiM.

How CAT4D works (source: project page)

The image below illustrates CAT4D’s capability to create 4D scenes from just 3 real or synthetic images or video frames.

Examples of 4D scenes created by CAT4D (source: paper)

What are multi-view video models?

Multi-view video systems (4D) generate multiple videos from various perspectives. This makes them especially useful for applications like virtual reality, gaming, and film production.

A multi-view video model (source: paper)

The 4 squares above illustrates different types of output sequences (orange) from one or several input images (grey):

Video models generate a sequence of frames that change over time, but you can’t control the camera’s position.
Multi-view models generate images from different angles at single moments in time.
Camera-controlled video models allow camera position control over time, but only from one angle at a time.
Multi-view video models generate views from all angles at all times, letting you see the scene from every angle simultaneously.

Obstacles in 3D and 4D scene reconstruction

The media we use to capture the real world – images and videos – only provide partial snapshots of specific moments. To fully represent the depth, structure, or temporal evolution of a scene, we need 3D and 4D reconstruction.

3D reconstruction involves creating static 3D models of a scene or object from 2D inputs, such as photographs or videos. This requires multiple images of the scene taken from different angles. The result depends on careful capturing techniques, such as ensuring consistent lighting and sufficient overlap between views.
4D reconstruction extends the 3D reconstruction by adding the time dimension, resulting in a dynamic 3D model. Examples include 4K4D, L4GM, and Stable Video 4D. This process is much harder because it demands synchronized multi-view video capture.

While data-driven methods have significantly advanced the 3D reconstruction for static scenes, extending these techniques to 4D remains challenging due to the difficulty in obtaining the extensive training data required. CAT4D overcomes this limitation by using a multi-view video diffusion model, which generates high-quality, photorealistic 4D content from a limited set of input views.

How CAT4D works

CAT4D follows a two-stage process (see the picture below):

Stage 1: A monocular video is input into a multi-view video diffusion model. This model generates multiple views from different angles and time points, effectively simulating a multi-camera setup.
Stage 2: The synthesized views are then processed by a deformable 3D Gaussian model to reconstruct a 4D scene, which are highly realistic new views from any desired angle and time point.

CAT4D – illustration of the method (source: paper)

Starting with a video captured by a single moving camera, additional video frames are created (orange frames) as if they were captured by multiple stationary cameras placed strategically around the scene. These virtual cameras can be positioned at the same spots as the original camera (gray circles) or in entirely new locations (blue circles). The generated frames are used to reconstruct a 4D scene with deforming 3D Gaussians. While the input trajectory is shown with changing viewpoints, this method also supports fixed-viewpoint videos.

Training datasets: CAT4D was trained on a diverse dataset combining real-world and synthetic data. This includes synthetic 4D scenes, real-world multi-view images for camera motion training, and real-world monocular videos with static viewpoints.

Experiments

The model was evaluated on its ability to independently control camera angles and time using the NSFF dataset, in comparison to 4DiM. Starting with three input images (row 1 in the next figure), CAT4D generates three sequences of images with the following properties:

varying viewpoints while keeping time fixed (row 2)
fixed viewpoint while varying time (row 3)
both changing viewpoints and time (row 4)

When comparing 4DiM (column 1), CAT4D (column 2), and the ground truth (column 3), it is evident that CAT4D is closer to the ground truth, offering better control and higher-quality images.

Qualitative comparison for disentangled control (source: paper)

The model was tested on creating 3D scenes from a few images in a bullet-time effect, where time slows down, and the camera moves around the scene. As shown below, CAT4D produces reconstructions that are more accurate and closer to the ground truth (GT) compared to CAT3D-1cond and CAT3D-2cond.

Qualitative comparison for sparse-view “bullet-time” 3D reconstruction (source: paper)

Visit the project page for more experimental results.

Limitations and future work

CAT4D faces some limitations in terms of its accuracy and versatility. The model sometimes mistakes changes in camera angle for changes in time, particularly when objects are moving and occluding each other. While the generated 3D scenes look realistic from different viewpoints, the way objects move in the 3D space might not be accurate.

The researchers propose overcoming these limitations by training larger multi-view video models and incorporating additional supervision signals, such as depth and motion estimates. Depth aids in understanding the scene’s spatial layout, while motion estimates capture object movements.

Conclusion

CAT4D is a new technology designed to transform single-view videos into immersive 4D experiences. It employs a multi-view video diffusion model to generate multiple perspectives of the same scene, which are then used to reconstruct dynamic 3D scenes.

The model has a wide range of real-world applications across various fields, such as robotics, medical imaging, augmented reality, and scientific research.