Make-An-Animation: a U-Net based diffusion model for 3D human motion generation

May 28, 2023

Make-An-Animation is a new text-to-motion generation model that creates realistic and diverse 3D human motion sequences from text descriptions. For example, if you type “a person is chopping down a tree with an axe”, the model generates 3D poses that can be rendered into a video of a person chopping down a tree with an axe.

The new model was proposed by a research team from Meta AI and it demonstrates significant advancements in the generation of text-conditioned human motion.

It outperforms the existing state-of-the-art models, particularly when handling diverse and challenging real-world text prompts.

It leverages a U-Net architecture, incorporating temporal convolution and attention layers, and was trained on a large-scale Text Pseudo-Pose (TPP) dataset.

A person is chopping down a tree with an axe

Recently, text-to-motion generation models, especially diffusion models can create high-quality and realistic images and videos.

However, the existing text-to-motion models struggle with small and limited datasets of human movement, making them perform poorly in real-world scenarios. These datasets do not have enough variety and naturalness of human movement. For example, if we use the in-the-wild prompt “a person is flying like a bird”, these models cannot create good videos because they have not seen similar examples before.

To overcome this limitation, the research team created a large-scale TPP dataset with 35M (text, static pose) pairs from diverse datasets containing both images and texts. Specifically, only the images featuring humans were selected.

A pair of (text, pose) contains the text (a sentence that describes what the person is doing or wearing in the image) and the pose (a single image of a person’s body shape and posture).

The model architecture and pipeline

The model is based on a U-Net architecture (see picture below).

It has three major components:

A text encoder based on the T5 architecture (it takes a text as input and converts it into an embedding that captures its meaning)
A text-to-3D diffusion pose generation model trained on the TPP dataset (it produces a sequence of human poses that match the text)
A set of extra layers for the temporal dimension (it processes time information and focuses on relevant parts of the input to create smooth and coherent motion)

The approach takes an avatar’s body pose (P) and represents its motion as a sequence of body poses for N frames, [P1; P2; …; PN]. Each pose Pi contains information about 21 body joints, the root orientation and the global position of the avatar in the frame.

The network was firstly trained for generating 3D poses based on text and then learned the time dimension to make motions. The temporal layers were added in the motion fine-tuning stage.

Evaluation

The model was evaluated on automatic metrics and human evaluation using 400 text prompts, and compared it with state-of-the-art baselines. The team also conducted ablation studies to analyze the impact of different components of the model and the TPP dataset.

Make-An-Animation method surpasses previous approaches in terms of generating realistic poses and aligning them with the provided text.

It also demonstrated the significance of the large-scale TPP dataset in human motion generation, surpassing the limitations of motion capture datasets.

Conclusion

This work proposes a new U-Net architecture that enables a smooth transition between static pose pre-training and dynamic pose fine-tuning.

It opens up new possibilities to use large-scale image and video datasets for learning to generate human 3D pose parameters, overcoming the size limitations of existing datasets.

In future work, the model can be extended to generate human motion from other modalities, such as speech or video, enabling more natural and multimodal communication.