InfiniteYou, photo customization with identity preservation

ByteDance introduced InfiniteYou (InfU), a powerful model that allows flexible photo modifications based on textual descriptions while preserving facial identity.

InfiniteYou: high-fidelity, aesthetic, aligned identity image generation (source: paper)

Standout features

  • Prompt-based editing with identity preservation – Enables users to modify photos using text prompts while ensuring high-fidelity image generation and maintaining identity consistency.
  • Diffusion Transformer (DiT) with InfuseNet – Enhances identity similarity and generation quality by integrating a powerful diffusion-based architecture with identity-specific feature injection.
  • Multi-stage training strategy – Improves text-image alignment and overall image aesthetics through a combination of pretraining and supervised fine-tuning with synthetic single-person-multiple-sample (SPMS) data.
  • Plug-and-play design for seamless integration – Ensures compatibility with various existing tools, enabling personalized tasks such as stylization and multi-concept generation.

InfiniteYou is available as open source on GitHub. A demo on Hugging Face is also provided. The code is licensed under Apache-2.0 and its models under the Creative Commons Attribution-NonCommercial 4.0 International Public License (read the key features), limiting their use to non-commercial purposes.

Advancements in identity-preserved image generation

Identity-preserved image generation is a challenging task because human facial features are highly complex, and accurately capturing subtle variations in expressions, angles, and details is difficult. Even minor discrepancies can significantly alter the perceived identity of an individual. Ambiguities in natural language further complicate the process, requiring models to infer intended edits like “add a smile.”

Traditional approaches, such as GANs and VAEs, often produce limited outputs or artifacts, while manually labeled datasets introduce biases that hinder generalization across facial structures and ethnicities.

To address these challenges, Diffusion Transformer (DiF) models have been developed as a better alternative (e.g., InstantID). Unlike GANs, which directly generate images in a single step, DiFs progressively refine noise into realistic images, making them better suited for detailed and controlled editing. However, current DiF methods still struggle with certain limitations: they don’t fully preserve identity and struggle with text-image coherence and editing, producing visible face copy-paste artifacts.

InfU builds upon Diffusion Transformer (DiF) models by introducing enhanced mechanisms for identity preservation and personalized image generation. It leverages multi-stage training processes and integrates additional components such as ControlNets, LoRAs, and OmniControl to improve controllability, flexibility, and multi-concept personalization.

The model

As illustrated in the image below, InfU employs a frozen DiT base model (e.g., FLUX) as the central image generation branch. The DiT model receives three distinct inputs: a Gaussian noise map (1), features extracted from an identity image (2), and features derived from a text prompt (3). By performing iterative denoising, the model integrates the information from these inputs to generate an image that aligns with the textual description while maintaining the facial identity of the subject.

The main framework of InfiniteYou (InfU) and the detailed architecture of InfuseNet (source: paper)

A multi-stage training strategy

InfiniteYou employs a multi-stage training strategy to overcome challenges in text-image alignment, aesthetic quality, and image fidelity in personalized photo generation. This strategy involves pre-training on real data followed by supervised fine-tuning (SFT) on synthetically enhanced data.

Multi-stage training strategy with synthetic single-person-multiple-sample data and supervised fine-tuning (source: paper)

Training steps:

  1. Pre-training on real data. The model is initially trained using a dataset of real single-person portrait images. It learns identity preservation by reconstructing images, using the same real portrait as both the source and target.
  2. Evaluation of the pre-trained model. The model is evaluated to assess its strengths and weaknesses. While identity preservation is strong, the model struggles with text-image alignment, aesthetics, and image quality. These shortcomings indicate the need for further refinement.
  3. Synthetic data generation. To improve the model, synthetic images are generated using external enhancement tools (e.g., aesthetic modules, LoRAs, and face swap tools). This transforms the training data from single-person-single-sample (SPSS) to single-person-multiple-sample (SPMS). The real image is used as the identity reference, while the synthetic images serve as the new training targets.
  4. Supervised fine-tuning (SFT) on synthetic data. The model is fine-tuned using the SPMS dataset, allowing it to learn enhanced aesthetics and better text-image alignment. The fine-tuning process ensures that the model generates high-quality images while retaining identity consistency with the real source images.
  5. Final inference and deployment. The fine-tuned model is ready for real-world use.

This strategy ensures high-quality personalized image generation with strong identity retention and improved visual appeal. Unlike previous approaches, it achieves these enhancements without requiring extra plugins or post-processing during inference.

Evaluation

Qualitative evaluation: Qualitative comparisons against state-of-the-art baselines (FLUX.1-dev IP-Adapter and PuLID-FLUX) reveal InfU’s superior performance across all evaluated dimensions, including identity similarity, text-image alignment, image quality, and aesthetic appeal, where the baselines exhibit limitations such as inadequate identity preservation, poor alignment, degraded quality, and face copy-paste artifacts (see the picture below).

Qualitative comparison results of InfU with the state-of-the-art models (source: paper)

Quantitative evaluation: The performance of InfU was assessed using three key metrics:

  • ID Loss: Measures the difference between the generated image and the reference identity image.
  • CLIPScore: Evaluates the alignment between the generated image and the input text prompt.
  • PickScore: Assesses overall image quality and aesthetics.

The comparative results are as follows:

MethodID Loss ↓CLIPScore ↑PickScore ↑
FLUX.1-dev IPA0.7720.2430.204
PuLID-FLUX0.2250.2860.212
InfU0.2090.3180.221
The ID Loss (lower is better) and CLIPScore (higher is better), and PickScore (higher is better) comparative results (source: paper)

InfU achieved the lowest ID Loss, indicating superior identity preservation, along with the highest CLIPScore and PickScore, reflecting better text-image alignment and overall image quality.

User study: A user study was conducted comparing InfU with PuLID-FLUX. Participants from diverse backgrounds evaluated the generated images based on identity similarity, text-image alignment, image quality, and aesthetics. InfU was preferred in 80% of the evaluations, suggesting a significant advantage over PuLID-FLUX in terms of human perception.

Flexible design

InfU has a simple, adaptable design that works with many existing systems. You can easily swap its core model with faster versions like FLUX.1-schnell to speed up image creation.

InfU also improves control and customization by working with tools like ControlNets, LoRAs, and OminiControl. This allows for personalized images with multiple concepts, including specific identities and objects. Furthermore, InfU supports IP-Adapter (IPA) for creating stylized personalized images based on reference styles, delivering high-quality results.

InfU is compatible with various existing methods (source: paper)

Using InfiniteYou: a quick start guide

InfiniteYou can be accessed through platforms like ComfyUI or Hugging Face demos. Below is a general guide to help you get started:

  1. Choose your platform: Select where to use InfiniteYou based on your preferences:
    • Hugging Face demo: Access InfiniteYou online, such as the InfiniteYou-FLUX demo.
    • ComfyUI: Install the InfiniteYou workflow locally for greater control and customization.
    • GitHub code: Clone the repository (ByteDance/InfiniteYou) for a custom setup.
  2. Prepare your input: Upload a photo of the person whose image you want to personalize and write a clear text prompt describing the desired output.
  3. Set up the environment for ComfyUI or local setups: install dependencies, download the required models (e.g., aes_stage2 or sim_stage1) and select a model based on your needs.
  4. Configure settings: Adjust parameters to optimize identity preservation, aesthetics, or text-image alignment.
  5. Generate the image: Run the model and review the output. If necessary, refine the prompt or settings and generate a new image.

This guide provides a streamlined process for using InfiniteYou efficiently across different platforms. Refer to the GitHub project documentation for more details.

Basically, InfiniteYou takes your photo and your words and uses some clever technology to create a new picture that follows your instructions but still keeps the important parts of your appearance the same. Whether you use it online or locally, the main steps are: give it a photo, tell it what you want, and let it create the new image.

Conclusion

InfU from ByteDance is an AI tool that allows for high-quality and identity-accurate image generation, making it easy to personalize images, express creativity, and enhance virtual experiences – whether it’s changing a photo’s style, creating visuals for games and movies, or designing realistic avatars for virtual worlds.

References

Recommended books

The above books use affiliate links. If you buy through them, we may earn a commission at no extra cost to you. Thank you for supporting the site!

Other popular posts