Nvidia’s new high resolution text-to-video synthesis with Latent Diffusion Models

The AI Lab of Nvidia in Toronto launched a text-to-video generation model that uses Latent Diffusion Models (LDMs). The research team integrated a time element into LDMs and successfully produced high-quality videos while minimizing the computational costs.

LDMs have the ability to produce outstanding images without the need for a high computational resources due to their training in a compressed latent space. By utilizing the LDM approach, the research team was able to synthesize videos with state-of-the-art quality.

Here is an amazing example of personalized text-to-video generation that was achieved by training a model using a collection of images sourced from DreamBooth. Using the provided set of images, the model was able to learn how to depict a frog playing the guitar in a band.

LDM was trained on a set of images from DreamBooth for personalized text-to-video generation

Generative models have been receiving a tremendous attention, with impressive breakthroughs in text, image, and text-to-image generation.

The currently most used generative models are: autoregressive transformers, generative adversarial networks, and the diffusion models (DMs). However, DMs offer several advantages over other types of generative models. For instance, they use a small number of parameters, are highly scalable, and computationally efficient.

Videos are more challenging than images because they require a correlation between their frames over time, making the task more complex. Models need to capture both spatial and temporal dependencies between frames. At the same time, the scarcity of publicly available video datasets creates a challenge for developing and evaluating video modeling algorithms.

The Nvidia research team has proposed a solution for addressing these issues, known as Video LDMs. This is an effective technique for producing videos by utilizing pre-trained image DMs and converting them into video generators. 

In this method, the term “latent” refers to the time dimension that is integrated into the model. The crucial aspect of this approach is the inclusion of temporal layers, which learn to align the images in a coherent manner, ultimately generating a video.

During their investigation the team focused on two important real-world video generation problems: high-resolution video synthesis of real-world driving data and text-guided video synthesis for creative content generation.

Methodology

The training process has three main steps:

  1. Pre-training on images only (or use readily available pre-trained image LDMs). The model learns to encode and decode static images, which provide a strong foundation for the model for learning to generate video data.
  2. Introducing a temporal dimension into the latent space of a Diffusion Model. Only the temporal layers were trained on encoded image sequences, while the pre-trained spatial layers were kept fixed, frozen.
  3. Fine-tune the LDM’s decode to achieve temporal consistency in pixel space.

The picture below shows some details of the Video LDM model framework.

At the top it shows the temporal decoder fine-tuning process. The per-frame encoder is frozen and the temporal decoder is fine-tuned for video generation.

At the bottom it’s a diffusion model, which is trained to generate high-quality latent variables that can be transformed by the decoder into realistic images. The training set typically consists of high-quality images.

Latent diffusion model framework and video fine-tuning of decoder

Results

The method was tested on 512 x 1024 pixels real-world driving videos and achieved state-of-the-art video quality. The authors also video fine-tuned a publicly available text-to-image LDM datasets and turned it into an efficient and powerful text-to-video generator with a resolution of up to 1280 x 2048 pixels (datasets: WebVid-10M & Mountain Bike).

In their project page the researchers put some examples that look sharp, with high quality. When you hover your mouse over one video, it shows the prompt used to generate it.

An example of text-to-video generation can be seen below. It comprises 113 frames with a resolution of 1280 x 2048 pixels, rendered at 24 frames per second, resulting in video clips that are 4.7 seconds in length.

Conclusion

The researches propose an efficient method for video generation using pre-trained image DMs. They introduced temporal layers to align images in a consistent way and create sharp, high quality videos.

Through this approach, the publicly available Stable Diffusion text-to-image LDMs were transformed into powerful and expressive text-to-video LDMs.

Learn more:

Other popular posts