LivePortrait, a fast and free AI tool to animate portraits

August 7, 2024

LivePortrait is an AI-powered tool that creates lifelike animations from portraits. Simply provide a photo of a person and a reference video of someone moving their face. The model then maps the video’s expressions onto the photo, resulting in a seamless and realistic animation.

It allows you to precisely control the movements of the eyes and lips, while preserving the identity and likeness of the original image.

The model can process and animate a variety of image formats, including color and black-and-white photos, as well as paintings and animated styles. LivePortrait can even animate multiple faces within a single image and bring animal photos to life.

The model is free and open-source. You can find the official implementation in the repository. Try the online demo to animate your own images using just a photo and a video.

Qualitative portrait animation with LivePortrait (source: paper)

LivePortrait’s key features:

Video-driven portrait animation
Implicit-keypoint-based framework (it allows for precise manipulation of facial features and reduces the computational cost)
Stitching and retargeting modules: a stitching module, an eyes retargeting module, and a lip retargeting module (they offer the desired level of control with minimal computational load)

What is stitching?

Stitching is a technique that combines multiple facial expressions or frames into a unified representation. LivePortrait breaks down the input video into individual frames, each representing a different facial expression. These expressions are then analyzed and combined to create a comprehensive set of facial features. This stitched representation forms the basis for the animated portrait.

What is retargeting?

Through retargeting, the facial movements are transferred from the input video to the static portrait. This process involves mapping corresponding points on both the input video and the static portrait to ensure an accurate and natural-looking animation.

LivePortrait uses key facial points to achieve high speed and better control

Rather than relying on pixel-level transformations like diffusion-based models, LivePortrait employs key facial points to create smooth and natural facial expressions. This approach offers superior speed and control.

Due to its efficient architecture that leverages key facial points and optimized computations, it achieves a remarkable generation speed of 12.8 milliseconds per frame on an RTX 4090 GPU. Despite being so fast, the model produces slightly better animations than those created by the diffusion-based methods. The diffusion-based approaches are known for being computationally heavy, often taking longer to process information.

The model

LivePortrait builds on the face vid2vid framework (a neural talking-head video synthesis model developed by NVIDIA) and incorporates several significant enhancements that increase the generalization ability and expressiveness of animation.

High-quality data curation: uses 69M video frames (filtered down from 92M) from about 18.9K identities and 60K static styled portraits.
Mixed image and video training strategy: combines realistic and stylized still images with existing video data to improve the model’s ability to animate various styles of faces.
Upgraded network architecture: unifies several components into a single model (M) that directly predicts the keypoints, head pose, and expression deformation.

Other improvements include the use of a more powerful generator, the SPADE decoder, which surpasses the original decoder in face vid2vid. Additionally, a PixelShuffle layer has been incorporated to increase the resolution from 256×256 to 512×512.

The model has 4 main components (see the picture below):

Appearance Feature Extractor: understands the look of the still photo.
Motion Extractor: analyses the movement in the driving video and extracts motion information.
- Keypoint Detector: finds important points on the face.
- Head Pose Estimator: finds the head’s position.
- Expression Deformation Estimator: understands how the face changes expression.
Warping Field Estimator: calculates how to move the facial features.
Generator: creates the final animated image.

Training

The model’s training follows 2 stages:

Stage I: base model training. They train the entire model, including appearance, motion, warping, and decoding components, from scratch. This phase establishes the model’s fundamental understanding of facial images and videos.

Stage I: base model training (source: paper)

Stage II: stitching and retargeting. They only train the stitching and retargeting modules while keeping other parameters frozen. The stitching module takes the animated face and places it back into the original image, ensuring there is no pixel misalignment, such as in the shoulder region. This also allows for handling larger images and animating multiple faces at once. The eyes and lip retargeting modules ensure that eyes and lip movements look natural. For example, if the person in the driving video has small eyes and the person in the original image has larger eyes, the eye retargeting module adjusts the eye sizes accordingly.

Stage II: stitching and retargeting (source: paper)

In short, the first stage builds a strong overall model, while the second stage specializes the model for precise adjustments, resulting in a more refined and accurate final output.

The training used 8 NVIDIA A100 GPUs. The first stage took approximately 10 days to train the model from scratch. The second training stage took approximately 2 days.

Evaluation

LivePortrait was compared with several non-diffusion-based methods (FOMM, Face vid2vid, DaGAN, MCNet, TPSM) and diffusion-based models (FADM, Ani-Portrait). Various benchmarks were used to measure generalization quality and motion accuracy of portrait animation results.

The experiments show that LivePortrait slightly outperforms previous diffusion-based methods, such as FADM and AniPortrait, in generation quality and demonstrates better eye motion accuracy than other methods.

The following image illustrates a qualitative comparison of self-reenactment. This process involves animating a single photo by applying the movements from a video of the same person. LivePortrait effectively transfers facial expressions, including subtle movements like eye blinks and lip movements, while preserving the identity of the source portrait. It outperforms previous methods in both image quality and motion transfer accuracy.

Qualitative comparisons of self-reenactment (source: paper)

The next image presents qualitative comparisons of cross-reenactment. This technique animates a person’s face using movements from a video of a different individual. LivePortrait better transfers lip movements and eye gazes from another person, delivering high-quality results even under challenging conditions.

Qualitative comparisons of cross-reenactment (source: paper)

Limitations

The new model faces challenges in cross-reenactment scenarios with large pose variations. Additionally, significant shoulder movements in the driving video can sometimes cause jitter.

Conclusion

LivePortrait transforms static images into realistic animations using an improved version of the face vid2vid framework. Additionally, it features a user-friendly interface, making it accessible for creating professional-quality animated content across various domains, such as social media, entertainment, marketing, and education.