FreeU: a simple method to boost diffusion model’s performance with no extra cost

September 30, 2023

FreeU is a new technique to improve the quality of images and videos that are generated by diffusion models, without any additional training or fine-tuning.

The method was proposed by a research team from the S-Lab of the Nanyang Technological University.

FreeU improves diffusion model sample quality at no costs (source: project page)

FreeU is an open-source diffusion model. You can access the code on GitHub at this link.

With just a few lines of code, any diffusion model that has a U-Net architecture can be easily enhanced by FreeU. This includes Stable Diffusion, DreamBooth, ModelScope, Rerender, and ReVersion.

FreeU employs the U-Net in a different way, by finding the optimal balance between removing noise and keeping details. It does this by adjusting the weights of the two components of the U-Net, the main network and the shortcut connections, also known as the backbone and the skip connections.

The new technique was evaluated on a variety of image and video generation tasks. The results showed that it significantly improves the generation quality of diffusion models.

What are diffusion models?

Diffusion models are generative models that create realistic images or videos from random noise. They work by gradually adding noise (Gaussian noise) to the original data and removing the noise step by step.

They usually employ a time-conditional U-Net architecture to implement the denoising model (it removes noise from data samples and it improves the quality of the generated samples). U-Net is faster and occupies less memory than other architectures (such as DDIM, DDPM or Transformer-based diffusion models). It can preserve more details and semantics of the original data.

The U-Net architecture

The U-Net architecture is a type of neural network that has a U-shaped structure with an encoder (the left part) and a decoder (the right part).

The encoder takes the noisy image as input and makes it smaller and simpler, while extracting its features, such as its shape, color, texture, and edges.
The decoder reconstructs the image from the extracted features.

There are two important components in the U-Net model:

The backbone features that contains the essential information of the image or video, such as its shape, color, texture, and edges. They mainly help to remove the noise.
The skip connections that transfer information from the encoder to the decoder, enabling the decoder to use the high-level features from the encoder. They help the decoder to retrieve the details and textures of the image or video, by removing the noise.

The researchers found that the skip connections can cause U-Net to miss some important features from the backbone, resulting in worse image or video quality. The skip connections introduce high-frequency features that are more dominant and noisy than the backbone features. However, if the skip connections are too strong, they can overpower the backbone features and make the network focus on the details and noises instead of the essential information. This can lead to artifacts and errors in the generated images.

FreeU boosts diffusion model’s performance

To improve the quality of image or video synthesis, the team introduces the FreeU technique.

FreeU re-weights the contributions from the skip connections and backbone features in a way that leverages the strengths of the both components.The skip features and backbone features are joined together at each decoding stage in U-Net and the FreeU operations are applied during this process of concatenation (see the picture below).

(a) The U-Net architecture with skip features and backbone features and (b) FreeU operations, with the two factors: b and s. The factor b is used to increase the strength of the backbone feature map x, while the factor s is used to reduce the strength of the skip feature map h. (source: paper)

In the picture above, the factor b aims to make the backbone features stronger and more important, while the factor s aims to make the skip connections weaker and less important.

By doing this, FreeU can balance the contributions of the skip connections and the backbone features, and thus generates better images and videos.

Evaluation

The research team performed a series of experiments to evaluate the performance of our proposed FreeU, comparing it with state-of-the-art methods such as Stable Diffusion, DreamBooth, ModelScope, and Rerender.

FreeU can make the image look better by removing the irregularities and making the structures more detailed.

The next pictures show how FreeU can improve the visual details and narrative of the images generated from text input. More examples are available on the project page.

Text to image

Samples generated by Stable Diffusion with or without FreeU. (source: paper)

Personalized Text to Image

Samples generated by DreamBooth with or without FreeU. (source: paper)

Relation Inversion

Samples generated by ReVersion with or without FreeU. (source: paper)

The authors conducted a quantitative evaluation of the FreeU technique on two text-to-image and text-to-video diffusion models, namely Stable Diffusion (SD) and ModelScope.

They used 35 participants to compare the image and video quality and alignment with the text prompts between the baseline models and the models with FreeU. The authors found that the majority of the participants preferred the models with FreeU, indicating that it improved the image and video synthesis performance.

Conclusion

FreeU is a new method that improves the quality of images and videos generated by diffusion models, without any additional costs. It does this by optimizing the contribution of features from the skip connections and backbone of the diffusion U-Net architecture.

This allows the model to focus on the features that are most important for generating high-quality images and videos, and to ignore the less important features.