InstantID generates identity-preserving images in seconds

February 3, 2024

InstantID is a fast method for generating customized human faces with various poses or styles, using only one reference ID image and a text prompt.

In just a few seconds, it can transform your appearance, change your hairstyle, swap clothing styles, or even adjust your environment – all while retaining your distinctive identity.

InstantID is open source and can be accessed here. It includes a guide for running locally with gradio, links to an online demo on replicate, and integrations with web UI and ComfyUI.

InstantID stylized synthesis (source: paper)

The model uses ID embedding, a way of keeping the identity of the reference image while letting it change styles easily. It also introduces a lightweight Image Adapter and an IdentityNet that improve the cross-attention and the diffusion model work.

InstantID outperforms the existing methods that use either CLIP embedding or multiple reference images. It demonstrates high fidelity, editability, and compatibility in various scenarios. It achieves state-of-the-art results in various image personalization tasks, such as style transfer, age progression, and face animation, without requiring any fine-tuning or multiple reference images.

It is compatible with popular pre-trained text-to-image diffusion models, serving as an adaptable plug-in in real-world applications.

Pipeline

The method consists of three key elements, as shown in the next figure:

A face encoder (instead of CLIP) that extracts the semantic information (such as eyes, nose, mouth, etc.) from the reference ID image and turns them into text features (face embedding).
An adaptive module that uses decoupled cross-attention to allow images as style prompts. For instance, you can choose an image of a famous person, a comic, or an artwork as a cue and the flexible module can apply it to modify the style of your face. It enables image-based visual guidance.
An IdentityNet that encodes the fine-grained features from the reference facial image with more spatial adjustment. It is a module that relies solely on face embedding without any text input.

The overall pipeline of InstantID (source: paper)

The pipeline of InstantID is as follows:

The user provides a single facial image and a textual prompt describing the desired style or attribute for the image generation.
The facial image is passed through a pre-trained face recognition model to extract a vector representation of the identity of the person in the image (the ID embedding).
The ID embedding is then fed into IdentityNet, which is a convolutional neural network that imposes strong semantic and weak spatial conditions on the ID embedding. The IdentityNet also takes the facial landmarks as inputs, and outputs a conditioned ID embedding that preserves the identity while incorporating the style or attribute information.
The conditioned ID embedding is then concatenated with a noise vector sampled from a Gaussian distribution, and passed through a pre-trained text-to-image diffusion model, such as SD1.5 or SDXL. The UNet works as a feature extractor that helps InstantID to preserve the identity of the given face image.
The generated image is then returned to the user.

The cross-attention layers use ID embedding instead of text prompts. This way the network only pays attention to ID-related features, and does not get affected by vague facial and background descriptions.

Training

InstantID builds upon a pre-trained text-to-image model by incorporating the three additional components mentioned above: the face encoder, the adaptive module, and the IdentityNet.

During training, only the Image Adapter and the IdentityNet parameters are updated. The pre-trained diffusion model parameters are frozen.

The model used the open-source LAION-Face dataset and they also collected 10 million high-quality human images from the Internet with annotations created by BLIP-2 automatically for training. The experiments were based on the SDXL-1.0 model and were conducted on 48 NVIDIA H800 GPUs (80GB) with a batch size of 2 per GPU.

Results

The qualitative results of InstantID’s performance across different settings can be seen in the next picture. Column 1 illustrates the outcome when no text prompt is provided. Columns 2-4 demonstrate how the text prompt improves the output. Columns 5-9 prove InstantID’s ability to leverage additional ControlNets like canny (a multi-step process to find the edges in an image) and depth.

Demonstration of the robustness, editability, and compatibility of
InstantID (source: paper)

The figure below showcases the impact of using different numbers of reference images on InstantID’s performance. When employing multiple reference images, the model uses the average mean of their ID embeddings to serve as the image prompt. Remarkably, InstantID excels even when operating with a single reference image.

Effect of the number of reference images (source: paper)

The authors compared their work with other methods for personalized generation that use one single reference image, such as IP-Adapter, IP-Adapter-FaceID and IP-Adapter-FaceID-Plus(see the figure below). InstantID has two advantages:

It uses ID embedding, which captures rich semantic information of the face, such as identity, age, and gender. This leads to more accurate and detailed face preservation.
It introduces ID embedding at both the cross-attention and the IdentityNet levels. This allows for better text control and style integration.

Comparison with previous works (source: paper)

They tested InstantID against LoRA models, which need multiple reference images. They discovered that InstantID can keep identity, change style better, and achieves similar results with just one image, without any extra training.

Comparison of InstantID with pre-trained character LoRAs (source: paper)

Conclusion

InstantID is a novel method for personalized image generation with a single reference image. Using a single image prompt, you can recreate your face in any pose or style, such as a celebrity, a cartoon, or a painting. You can also use multiple reference images.