Ultra Fast ControlNet with Hugging Face Diffusers

March 23, 2023

Ultra Fast ControlNet with Hugging Face Diffusers is a new technology that allows users to control the text-to-image generation process by adding extra conditions.

In short, the approach combines the power of Hugging Face Diffusers with the ControlNet Neural Network to fine-tune the process of text-to-image generation.

The new pipeline can easily transform a cartoon drawing into a realistic photo with remarkable consistency, bring some of the popular logos to life or turn a rough sketch into a refined artwork.

Sketch scribble turned into an artistic drawing

What is ControlNet?

ControlNet is a neural network that can be used to control Diffusion models, including Stable Diffusion, by adding conditional inputs such as scribbles, edge maps, segmentation maps, or pose key points during the text-to-image generation.

The research paper suggests eight different conditioning models, all of which being compatible with Diffusers. To support every new type of conditioning, a new set of ControlNet weights needs to be trained.

To train ControlNet, first, the pre-trained parameters of a Diffusion model (like Stable Diffusion’s latent UNet) are cloned to create a “trainable copy” while preserving the original pre-trained parameters in a separate “locked copy”.

The “locked copy” preserves the vast knowledge learned from a large dataset.

The “trainable copy” learns your condition. The learning is robust even when the training dataset is small, making it a great option for personal devices.

The “locked” and “trainable” copies are then connected via “zero convolution” layers.

It is possible to merge multiple ControlNet conditionings to generate a single image.

The experimental results

ControlNets have been experimentally implemented with various image-based conditions to ensure that when dealing with relatively simple objects, the model can accurately control the details.

Controlling Stable Diffusion (anime weights) with cartoon line drawings

Controlling Stable Diffusion with human pose (“Michael Jackson’s concert”)

However, if the model’s semantic interpretation is inaccurate, it may encounter difficulties in generating precise content. Addressing these limitations may be a topic for future research.

Conclusion

As text-to-image models of considerable size become increasingly common, users may discover that creating visually stunning images can be accomplished with nothing more than a brief textual description.