AniPortrait is a new framework that creates dynamic and expressive animated portraits from audio inputs and a reference portrait image.
With AniPortrait, a static image can be animated into a lively, speaking character that mimics the nuances of human expression, all synchronized with an audio track.
Developed by researchers at Tencent, this tool can be used across diverse domains ranging from virtual reality and gaming to the broader digital media landscape. The model is open source and accessible in the GitHub repository. You can also try an online demo on HuggingFace.
Method
Generating animations from audio and static images is challenging due to the need for precise alignment of lip movements, facial expressions, and head positions. Recent advancements in diffusion models, particularly those enhanced with temporal modules, open up new ways for creating lifelike animated portraits.
AniPortrait employs the network architecture derived from Animate Anyone that integrates Stable Diffusion 1.5 to produce high-quality videos from a reference image and a sequence of body motions.
AniPortrait consists of 2 modules:
- Audio2Lmk (Audio to facial Landmarks) extracts the speaker’s facial expressions and lip movements (3D intermediate representations) from the audio input using the pre-trained wav2vec2.0. It then translates these into a 2D sequence of facial landmarks. Imagine saying a word or laughing heartily. Audio2Lmk listens to these sounds and translates them into a special code that tracks every tiny change in expression and lip movement, bringing your voice to life with amazing detail.
- Lmk2Video (Lmk to video) generates high-quality video portraits from the 2D facial landmarks and the reference portrait image using a diffusion model and a motion module.
The portrait animation consists of a sequence of frames that are remarkably realistic and consistently smooth, without any abrupt transitions that could potentially disrupt the viewer’s experience.
Beyond the sequence of facial landmarks derived from audio, the team incorporated landmarks derived from the portrait image. This additional input helps the network to create animations that are closely aligned with the natural movements of the subject.
Training
The model is trained in two stages:
1. In the Audio2Lmk stage, it uses wav2vec2.0 and MediaPipe to learn from audio data and generate a 3D model of a face.
2. In the Lmk2Video stage the system learns to transform the 3D face models into realistic video animations, in 2 steps:
- train the 2D components of ReferenceNet and PoseGuider, excluding the motion module. Training data: an internal dataset comprising nearly an hour of high-quality speech data from a single speaker (used for the Audio2Mesh component).
- train the motion module, after freezing the previously trained components. Training data: 2 high-quality facial video datasets, VFHQ and CelebV-HQ.
To ensure consistency during training, all images are resized to a standard resolution of 512×512. The training process is conducted on 4 A100 GPUs, each stage following two days of training.
Experimental results
As illustrated in the next picture, AniPortrait produces a sequence of animations that are remarkable for their high quality and lifelike details.
Conclusion
AniPortrait represents a significant contribution to the field of facial animation. Its ability to generate photorealistic animations from audio input opens doors for innovative applications in various domains.
Yet this approach requires extensive, high-quality 3D datasets which is expensive. Also, the generated portrait videos may exhibit artifacts associated with the uncanny valley effect, where imperfections in realism can trigger a sense of discomfort in viewers.
Moving forward, the team aims to adopt the EMO method to directly predict portrait videos from audio (without the need for intermediate 3D models or facial landmarks).
Learn more:
- Paper on arXiv: “AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation”
- Repository (code and model weights)