Text2Video-Zero: text-to-video diffusion models without any training or optimization

March 26, 2023

Text2Video-Zero is a new low-cost approach to produce videos from text prompts without the need for optimization or fine-tuning. It utilizes the capabilities of existing text-to-image synthesis methods, such as Stable Diffusion, and makes them suitable for generating videos, without any additional training.

Text-to-Video generation: “a horse galloping on a street”

Text-to-Video generation: “a panda is playing guitar on times square”

The model

In order to create videos instead of still images, Stable Diffusion must process sequences of latent codes. The current approaches for creating video sequences result in the production of images that are entirely arbitrary, lacking both object appearance and motion coherence.

To address this issue, the researchers suggested incorporating motion dynamics between the latent codes, in two steps:

Firstly, they added motion information to the latent codes of the generated frames to maintain the consistency of the global scene and background time. This means that the model can generate videos that are more realistic and consistent with the real world.

Secondly, they employed a cross-frame attention mechanism to retain the identity and appearance of the foreground object. This means that the model can generate videos that are more accurate and detailed in terms of foreground objects.

The overview of Text2Video-Zero + ControlNet

Results

Combining Text2Video-Zero method with ControlNet and DreamBooth diffusion models, the authors were able to produce high-quality and consistent videos with minimal costs.

Despite not being trained on additional video data, the approach performs comparably or better than other existing methods. Its effectiveness was also tested with the instruction-guided video Instruct-Pix2Pix application.

Conclusion

The use of Text2Video-Zero pipeline opens up novel possibilities for generating and editing videos. This method eliminates the need for optimization or fine-tuning, making it easily accessible and cost-effective for everyone.

Furthermore, it can be applied to various tasks beyond text-to-video generation, including content-specialized video generation, as well as instruction-guided video editing.