4K4D uses AI to create high-fidelity 4D views of dynamic scenes

October 23, 2023

4K4D is a new AI method for generating high-quality and real-time images of dynamic 3D scenes (4D) at 4K resolution. It can reconstruct the scene from any angle you want, using multiple input videos of the same scene. Notably, 4K4D is more than 30 times faster than previous methods.

It was proposed by a research team from Zhejiang University, Image Derivative Inc. and Ant Group.

Visit the project page here. The source code has not been released yet, but the authors said it will be available soon.

Main tools: 4K4D uses a space-carving algorithm to get the point clouds of the scene from multiple videos, a neural network that can learn how the scene looks from the point clouds and powerful GPUs (Nvidia’s RTX 3090 and RTX 4090) that enable the hardware-accelerated rendering.

Here are some examples of images created by 4K4D. It generates highly detailed and clear images at a resolution of 1125×1536 pixels. It can also render these images very fast, achieving over 200 frames per second (FPS) on an RTX 3090 GPU, while maintaining state-of-the-art quality on the DNA-Rendering dataset.

4K4D generates photorealistic and real-time dynamic 3D scenes. (source: paper)

You can also watch a video demonstration of 4K4D below:

Imagine being able to watch a 3D movie from any angle you want, or to explore a virtual world that looks and feels as real as the one you live in. This is the vision of dynamic view synthesis, a field of computer graphics that aims to create realistic and high-quality images of moving 3D scenes from different perspectives.

This means that you could use multiple videos of a scene to create a 3D model of that scene, and then rotate and zoom in on the model to view the scene from any angle. This technology could be used to create immersive 3D movies and video games, or to explore virtual worlds that look and feel as real as the real world.

However, achieving this vision is not easy, as current methods are slow and limited in rendering high-resolution images.

4K4D uses two main ideas to address these challenges:

A 4D point cloud representation that supports hardware rasterization and enables unprecedented rendering speed. This is a way of representing 3D scenes as a collection of points that also have a time dimension.
A novel hybrid appearance that improves the rendering quality and helps the model to create very realistic and accurate images.

4K4D pipeline

The method follows 4 main steps (see the next figure):

(a) Point Cloud Sequence: The method uses multiple videos of the same scene, taken from different angles, and a space-carving algorithm to extract the initial cloud sequence (x, t) of the target scene (the points that form the scene). These points change their position and appearance over time, just like the objects in the scene, and create a cloud sequence. The method uses a 4D feature grid to assign a feature vector to each point. The feature vectors are fed into multilayer perceptron networks (MLPs).

(b) Geometry: The geometry model uses the point location, radius, and density to create a semi-transparent point cloud that represents the shape of the scene.

(c) Appearance: The 4K4D system uses a hybrid appearance model that combines two techniques to represent the visual features of a scene.The first technique is image-based rendering (IBR) to capture the high-frequency details and shadows of the scene. The second technique is spherical harmonics (SH) to capture the low-frequency lighting variations of the scene, which are smooth and continuous. By combining IBR and SH, the 4K4D system can create realistic and dynamic representations of complex scenes.

(d) Differentiable Depth Peeling: It’s an algorithm that exploits the hardware rasterizer to render images of the dynamic scene representation from different angles at a very high speed and quality. The ability to differentiate allows for the training of the model using RGB video inputs.

In short, the 4K4D system works by first obtaining a sequence of point clouds from the dynamic scene. Once the sequence of point clouds has been obtained, the 4K4D system uses the 4D feature grid to represent the geometry of the scene. The 4D feature grid assigns a feature vector to each point in the scene.

This feature vector contains information about the point’s position, radius, density, and spherical harmonics coefficients. 4K4D then uses a hybrid appearance system to model the appearance of objects in the scene.

Finally, it uses the differentiable depth peeling algorithm to render images of the scene from different viewpoints. The differentiable depth peeling algorithm exploits the hardware rasterizer to achieve unprecedented rendering speed.

Datasets

4K4D was trained and evaluated on multi-view datasets including DNA-Rendering, ENeRF-Outdoor, NHR, and Neural3DV.

Training

The model was trained by using a supervised learning approach, where it compared the images it rendered from 4D point clouds to the ground-truth images. During the training stage the model learned to create realistic 4K images from 4D point clouds by minimizing three types of errors:

The color error measures how close the pixel colors are to the ground-truth images.
The perceptual error ensures that the rendered images look realistic. It measures how similar the image features are to the ground-truth images.
The mask error keeps the dynamic regions in the rendered images consistent with the ground-truth images. It measures how well the dynamic regions match the ground-truth images.

Evaluation

4K4D was evaluated on different public datasets, such as DNA-Rendering, ENeRF-Outdoor and Neural3DV, and compared with previous methods. It outperformed them in terms of rendering speed, while being competitive in the rendering quality.

4K4D achieved a rendering rate of more than 400 FPS when working with the DNA-Rendering dataset at 1080p resolution and 80 FPS on the ENeRF-Outdoor dataset at 4K resolution, utilizing an RTX 4090 GPU. This is significantly faster than previous methods, which typically render images at around 10 FPS or less.

Here’s a qualitative comparison of different methods on the ENeRF-Outdoor and Neural3DV datasets:

Qualitative comparison of different methods on the ENeRF-Outdoor dataset (which contains 960 × 540 images). 4K4D produces much higher quality images than ENeRF and can render them 24 times faster. (source: paper)

Qualitative comparison of different methods on the Neural3DV dataset (which contains 1352×1224 images). 4K4D can preserve the fine details of moving objects and also keep the boundaries around occlusion sharp. (source: paper)

Limitations

The model has some drawbacks that need to be addressed in future work. It cannot track the motion of points across frames and it also needs a lot of storage space for long videos.

Conclusion

4K4D is a very fast and powerful new AI tool for generating high-quality images of dynamic 3D scenes. It has many potential applications, such as creating virtual experiences, developing new video games, and improving autonomous driving systems.