SDXL: the next generation of Stable Diffusion models for text-to-image synthesis

July 15, 2023

Stable Diffusion XL (SDXL) is the latest text-to-image generation model developed by Stability AI, based on the latent diffusion techniques. SDXL has the potential to create highly realistic images for media, entertainment, education, and industry domains, opening new ways in practical uses of AI imagery.

It outperforms previous versions of the Stable Diffusion models and is competitive with the best image generators available in the market.

The SDXL series offers more than just simple text prompting. It also has the ability to manipulate images in various ways, such as image-to-image prompting (producing different versions of an input image), inpainting (completing the missing parts of an image), and outpainting (extending the boundaries of an existing image while maintaining the overall style and content).

SDXL 0.9 (released in June 2023) is the newest in the Stable Diffusion text-to-image suite of models, following Stable Diffusion XL beta (released in April 2023). SDXL can be used on the Stability AI platforms ClipDrop and DreamStudio, along with other image generating tools such as NightCafe. You can access the diffusion models here (base and refiner).

A full open release of SDXL 1.0 was planned for mid-July 2023.

Here are two examples of the prompts tested on both SDXL beta (left) and SDXL 0.9 (right).

**Prompt**: ✨aesthetic✨ aliens walk among us in Las Vegas, scratchy found film photograph. Image source

**Prompt:** A wolf in Yosemite National Park, chilly nature documentary film photography
**Negative prompt:** 3d render, smooth, plastic, blurry, grainy, low-resolution, anime, deep-fried, oversaturated. Image source

SDXL 0.9 greatly enhances the image and composition quality compared to the previous version, being able to generate hyper-realistic images for various domains and applications.

**Prompt**: beautiful scenery nature glass bottle landscape, purple galaxy bottle (SDXL 0.9 – 1024×1024). Image source

The SDXL model

Diffusion models are a type of generative model that can be used to generate high-quality images by gradually adding noise to a noise image and then trained to invert the noise. Latent diffusion models operate in the latent space of powerful pretrained autoencoders, which reduce the complexity and size of the model while preserving the details and fidelity of the generated images.

The SDXL model is a new latent diffusion model that has been shown to achieve state-of-the-art results on a variety of image generation tasks. The SDXL model has several key innovations that make it more powerful than previous latent diffusion models.

it uses a larger latent space than previous models. This allows the model to generate more realistic and diverse images.
it uses a multi-stage diffusion process. This process allows the model to generate images with more fine-grained detail.
it uses a refinement stage. This stage helps to improve the quality of the generated images by removing noise and artifacts.

The SDXL pipeline and training

The SDXL pipeline is a two-stage process for generating images:

a diffusion model that generates a basic image (initial latents of size 128 × 128) based on the input prompt. It is responsible for generating the overall structure of the image.
a refinement model which is used to add finer details to the images generated by the diffusion model (see the picture below).

SDXL was trained on a dataset of over 100 million images from various domains, such as animals, landscapes, people, food, etc. The dataset was split into different aspect ratios, such as 1:1, 4:3, 16:9, etc. The model was trained in three steps:

pre-train a base model at a resolution of 256×256 pixels, using size and crop-conditioning, on an internal dataset.
train it on bigger images of 512×512 pixels with the same shape.
in the final stage of training, use a multi-aspect training in combination with offset noise. The model was trained on different aspect ratios of approximately 1024 × 1024 pixel area.

Architecture & scale

The following table compares SDXL with previous Stable Diffusion models. SDXL improves the arrangement of transformer blocks in the UNet.

It also employs a more powerful text encoder for text conditioning that combines OpenCLIP ViT-bigG with CLIP ViT-L. The UNet in SDXL has 2.6B parameters and the text encoders have 817M parameters.

Model	SDXL	SD 1.4/1.5	SD 2.0/2.1
# of UNet params	2.6B	860M	865M
Transformer blocks	[0, 2, 10]	[1, 1, 1, 1]	[1, 1, 1, 1]
Channel multiplier	[1, 2, 4]	[1, 2, 4, 4]	[1, 2, 4, 4]
Text encoder	CLIP ViT-L & OpenCLIP ViT-bigG	CLIP ViT-L	OpenCLIP ViT-H
Context dimension	2048	768	1024
Pooled text embedding	OpenCLIP ViT-bigG	N/A	N/A

Micro-conditioning

Micro-conditioning is a technique used in the SDXL image generation model to improve the quality of the generated images. It works by providing the model with additional information about the desired output image.

The SDXL model uses a two-stage micro-conditioning process. In the first stage, the model is conditioned on a text prompt and in the second stage, the model is conditioned on a reference image. This helps the model to refine the initial image and make it more realistic.

Micro-conditioning is a powerful technique that can significantly improve the quality of the generated images, but it can be computationally expensive.

Conditioning the model on cropping parameters

In previous SD models, the objects in the images can be incomplete, because of the random cropping during training of the model.

For the SDXL model, the research team found a different solution: they inform the model how much they cropped the image, and let the model adjust accordingly. The results are much improved (see the picture below).

A typical failure mode of previous SD models: Synthesized objects can be cropped. Image source

Evaluation

SDXL was evaluated on various metrics and tasks, such as image quality, diversity, fidelity, composition, inpainting, outpainting, image-to-image prompting, etc. SDXL was compared with previous versions of Stable Diffusion and other state-of-the-art image generators, such as DALL-E, VQGAN+CLIP, and BigGAN+CLIP.

SDXL was able to generate images that were more semantically similar to the text prompts than previous versions of Stable Diffusion, being competitive with those of black-box state-of-the-art image generators.

The authors tested SDXL with and without the refinement stage by asking users to choose their favorite image from four models. The results showed that SDXL with the refinement stage is the most popular, and much better than the other models.

The figure below shows user’s preference between SDXL and Stable Diffusion. We can see that SDXL outperforms Stable Diffusion 1.5 & 2.1, and the extra refinement stage improves it further.

Comparing user preferences between SDXL and Stable Diffusion 1.5 & 2.1

The training, refinement, and evaluation of SDXL are all ongoing processes. As the model continues to be developed, the team at Stability AI is constantly working to improve its performance.

Conclusion, future work

SDXL is a powerful text-to-image model that can generate realistic images from natural language prompts. It is one of the largest and most advanced models of its kind, and it offers a range of functionalities and features.

You can access and try SDXL whether you want to capture and edit images with your phone, create and share images online, or integrate image generation into your own projects.

The model has some limitations to be addressed in a future work, including difficulty in synthesizing fine-grained details (like human hands), potential biases from the training data, and high computational requirements.