TryOnDiffusion: try on virtual clothes with the power of two UNets


TryOnDiffusion is a new method that leverages diffusion models and cross attention mechanisms to generate realistic images of how a garment worn by one person might look on another person. It has a diffusion-based architecture that unifies two UNets that can adjust and combine the garment and the person images in one step.

The paper was proposed by a research team from the University of Washington and Google Research and was presented at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2023.

TryOnDiffusion allows you to generate realistic images of how clothes fit your body shape, size, and pose, and to show how they match your skin tone, hair color, and accessories.

Currently, U.S. shoppers using Google Shopping can virtually try on tops for women from various brands such as Anthropologie, Everlane, H&M and LOFT. Just look for products with the “Try On” badge on Search. Men’s tops and “other apparel” will be available later this year.

TryOnDiffusion generates a visualization of how the garment might look on the target person

The new model tries to make these pictures look realistic and keep the details of the clothes, such as the texture, color, and pattern.

Experimental results indicate that TryOnDiffusion achieves state-of-the-art performance both qualitatively, quantitatively. It can produce detailed clothes at 1024×1024 resolution, while changing their shape and size to match the person.

The model

TryOnDiffusion uses two UNets (Parallel-UNet Diffusion) with cross attention to synthesize realistic images. This architecture enables the model to maintain the garment details and to adjust the garment for body shape and pose in a single network. The model takes a person image and a garment image as input and generates a 1024×1024 try-on image.

TryOnDiffusion pipeline

The model’s pipeline consists of four steps: preprocessing128×128 Parallel-UNet256×256 Parallel-UNet, and super resolution diffusion. The input images are a person and a garment worn by another person. The output image is the person wearing the garment.

  1. Preprocessing: This step prepares the input images for the next steps. It removes the background from the person image and the garment image, so that only the person and the garment are left.
  2. 128×128 Parallel-UNet: This step is the main part of the pipeline. The clothing agnostic RGB image, the segmented garment, and the computed poses for the person and garment are taken as inputs (see Try-on Conditional Inputs). This network generates a 128×128 “try-on” image of the person wearing the target garment in a realistic manner.
  3. 256×256 Parallel-UNet: This step is similar to the previous one, but it uses a larger size of 256×256 for the images. The generated 128×128 try-on image from the previous step is used as input, along with the Try-on Conditional Inputs (additional information that helps improve the quality of the output). The output of this step is a 256×256 image of the person wearing the target garment.
  4. Super resolution diffusion: This step is the final part of the pipeline. It takes the 256×256 image and increases its resolution to a final 1024×1024 image of the person wearing the garment. (The diffusion network makes images sharper and clearer by adding small changes to each pixel).

The image below illustrates the Parallel-UNet for 128×128 resolution. It consists of two UNets: one that processes the person image and one that processes the garment image.

The 128×128 Parallel-UNet

The clothing agnostic RGB is an image where the person’s body is shown without any clothing. The output is a 128×128 image that shows how the person would look wearing the garment.


Datasets: The authors collected 4K pairs of images for training where each pair has the same person wearing the same garment but appears in different poses. They created 6K unpaired images for testing their model (evaluating their model’s performance on new and unseen data).

The datasets include both men and women captured in different poses, with different body shapes, skin tones, and wearing a wide variety of garments with diverse texture patterns. In addition, the model was tested on the VITON-HD dataset, which has higher resolution images.

Training: They trained three diffusion models based on the U-Net architecture: one for 128×128 Parallel-UNet, one for 256×256 Parallel-UNet, and one for super resolution diffusion. Each diffusion model is trained separately with different hyperparameters.

During the training, the two input pictures show the same person wearing the same garment but in different poses. This way, the method can understand and learn how to change and blend the garment and the person.

Inference: During the inference stage, the input pictures are of two separate individuals wearing different clothes in diverse poses. The method makes a new picture that shows how the garment might look on the first person.

Evaluation: TryOnDiffusion showed that it could handle large occlusions, pose changes, and body shape changes, while preserving garment details at 1024×1024 resolution. The user study involved 15 non-experts, who compared more than 2K different random samples and ranked them. According to the study, the model’s results were selected as superior to three recent state-of-the-art methods in 92.72% of cases.

TryOnDiffusion on eight target people (columns) dressed by five garments (rows)


Person garment TryOnDiffusion is a technology that allows you to generate realistic images of how a garment worn by another person might look on you.

It has many applications in practice, such as enhancing the online shopping experience for customers or enhancing the creative design process for fashion designers who want to experiment with different styles, colors, and patterns.

The model can also help to provide personalized recommendations for customers based on their body shape, size, and preferences. This way, you can discover new clothes that flatter your figure and express your personality.

Learn more:

Other popular posts