GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents (SIGGRAPH 2023 technical paper awards)

August 13, 2023

GestureDiffuClip is a new framework that can create realistic and expressive body movements that fit the speaker’s speech content (co-speech gestures). The system allows the user to input a style prompt, which can be a text description, a motion clip, or a video that shows the desired style of the gestures.

GestureDiffuClip generates natural and diverse motions of the hands and arms that better express the emotions, attitudes, and intentions of a person speaking.

For example, you can use the following style prompts to generate different gestures for the same speech:

Text: “Be confident and assertive”
Motion clip: A clip of a TED talk speaker gesturing with enthusiasm and authority
Video: A video of a politician delivering a speech with strong hand gestures and facial expressions

You can also specify the style of individual body parts, such as “wave your left hand” or “nod your head”.

The picture below illustrates how different text prompts can affect the co-speech gestures (movements that people make when they talk) that are generated for the same speech.

For the same speech clip, GestureDiffuCLIP creates gestures with varying styles based on four different text prompts (Source: paper)

The model

GestureDiffuCLIP uses two main components:

A latent diffusion model (gesture generator). This is a neural network that takes as input a speech audio and transcripts, and outputs a sequence of joint angles that define a gesture.
A CLIP-based encoder (representation of style). The CLIP-guided mechanism extracts the style representations from different input modalities, such as text, motion clips, or videos, and infuses them into the latent diffusion model via AdaIN (Adaptive Instance Normalization) layer.

The two core components of the GestureDiffuCLIP (Source: paper)

In conclusion, the gesture generator and the CLIP representation of style work together to generate stylized co-speech gestures that are both visually appealing and semantically correct.

The denoising network (part of the latent diffusion model)

The denoising network is responsible for generating realistic and diverse gestures from random noise, while incorporating the speech content information (see the figure below).

It is based on a multi-layer transformer that has a special type of attention mechanism called multi-head causal attention. It only pays attention to the past and present information, but not the future. This is because the co-speech gestures should be synchronized with the speech, and don’t anticipate what will be said next.

The architecture of the denoising network (Source: paper)

It takes three inputs: the speech audio, the transcripts of the speech, and the style prompt.

The speech audio and the transcripts of the speech are processed by two separate encoders that transform them into feature vectors (numerical representations that capture the meaning and characteristics of the inputs).
The style prompt is processed by one of three CLIP-based encoders (the text encoder, the motion encoder, or the video encoder). CLIP can associate images and texts that have similar meanings or styles. For example, it can recognize that a picture of a sunset and the word “romantic” have a similar style.

The multimodal features from these encoders (feature vectors from the speech audio, transcript, and style prompt) are then integrated into the denoising network at various stages through two types of layers:

The multi-head semantics-aware layers are used to combine the multimodal features in a way that is sensitive to the meaning of the speech. For example, the semantics-aware layers might be used to combine the audio and transcript features in a way that highlights the words that are being spoken.
The AdaIN layers are used to adjust the style of the generated gesture according to the style prompt. For example, if the style prompt is “excited,” the AdaIN layers might be used to make the generated gesture more energetic.

Finally, after passing through AdaIN, the feature vectors are fed into a decoder network that transforms them back into co-speech gestures. The decoder network outputs a sequence of 3D coordinates that represent the positions and orientations of different body parts over time.

The Feed-Forward Network (FFN), Add & Norm are used to improve the performance of the denoising network.

Training

The system is trained on a dataset of co-speech gestures, along with their corresponding audio and transcript. The research team used two high-quality speech-gesture datasets: ZeroEGGS and BEAT. The GestureDiffuCLIP system is trained using a two-stage process:

The gesture generator is trained to generate realistic and diverse gestures from the speech audio and transcript.
The AdaIN layer is trained to modify the gesture features according to the style prompt.

Evaluation

The GestureDiffuCLIP system was evaluated using a variety of metrics, including: visual realism, semantic correctness, and style control.

For example, the character displays angry gestures when the text prompt is “the person is angry”. The character mimics Hip-hop style gestures from a Hip-hop music video. It can also recognize semantic information from a non-human video, such as trees swaying with the wind (see the picture below).

Gestures synthesized by GestureDiffuCLIP conditioned on three different types of style prompts (Source: paper)

The research team also conducted a user study to assess the human perception of the generated gestures. The results showed that GestureDiffuCLIP can effectively enhance co-speech gestures according to the style prompts for each speech sentence.

Based on various prompts, GestureDiffuCLIP can adjust the styles of different body parts individually, while keeping a natural harmony among the body parts (see the video below).

Video source: project page

Conclusion

GestureDiffuClip is a powerful tool that generates stylized co-speech gestures for a variety of purposes.

It can be used to improve the communication of speakers, to make presentations more engaging, or to create more expressive characters in virtual worlds.