Stable Diffusion (SD) is a text-to-image diffusion model released in 2022 by Stability AI.

This deep learning model can produce complex images by using text prompts, and it can also perform additional tasks such as inpainting (fill in missing areas of an image), outpainting (expand the area of an image), and image-to-image translations (translating an input image into an output image by changing some characteristics including style, colors, texture or daytime image converted to nighttime image) guided by text prompts.

Stable Diffusion was developed through a joint effort of Stability AI, academic researchers, and non-profit organizations.

Images generated by SD

A fundamental characteristic of Stable Diffusion is its nature as a generative model, being able to create new images that are similar to the data used for training. This capability enables it to generate a wide range of images that are consistent with the input textual description.

Unlike previous text-to-image models like DALL-E and Midjourney, which were exclusive to cloud services, Stable Diffusion is open source and can run on most consumer hardware with a GPU that has at least 8 GB VRAM.

Diffusion Models

Diffusion Models are generative models that create new data that resembles the data they were trained on. These models train by adding random (Gaussian) noise to the data and then learning to reverse the noise.

The Diffusion Model consists of two main steps:

  • The forward process, diffusion. The model adds noise to the data. 
  • The backward process, denoising. The model learns to remove the added noise and generate new, similar data.

SD architecture

The basic architecture of Stable Diffusion consists of two main components: a text encoder and a decoder. The text encoder is responsible for encoding the input textual description into a latent vector representation, which is then used to generate the final image. The decoder is a deep neural network that uses a stable diffusion process to generate high-quality images from the latent vector representation.

SD architecture

The model uses a cross-attention mechanism. In a cross-attention mechanism, there are typically three sets of learned parameters: Q-a query tensor, K-a set of key tensors, and V-a set of value tensors.

In machine learning, a tensor is a multi-dimensional array that can be used to represent data, such as images, videos, or text.

Given a query tensor Q, a set of key tensors K, and a set of value tensors V, the cross-attention mechanism computes a weighted sum of the value tensors, where the weights are computed based on the similarity between the query and key tensors. The resulting weighted sum is then used as the output.

The encoder takes an image x in RGB space and encodes it into a latent representation z = E(x), while the decoder reconstructs the image from the latent representation z, giving ~x = D(z) = D(E(x)).

Training data

Stable Diffusion was trained on pairs of images and captions from LAION-5B, a dataset of 5 billion image-text pairs derived from Common Crawl data scraped from the web. The LAION-5B dataset was classified based on language and filtered into separate datasets by resolution, likelihood of containing a watermark, and predicted “aesthetic” score. Stable Diffusion’s generation outputs can be fine-tuned to match more specific use cases through three methods: embedding, hypernetwork, and DreamBooth.

Pipeline

The pipeline of Stable Diffusion (SD) involves several steps, including:

  1. Input Prompt Encoding: The user inputs a text prompt, which is then encoded into a vector representation using a pre-trained natural language processing (NLP) model such as GPT-3.
  2. Latent Diffusion: The encoded prompt is used to condition the latent diffusion process, which is the core of SD. In this step, the model generates a field of pure noise and then identifies familiar shapes that match the words in the prompt. It brings these shapes into focus and blends them together to form a coherent image.
  3. Image Generation: Once the shapes have been brought into focus, the model generates a full image by blending them together using various techniques such as StyleGAN2.
  4. Image Refinement: The generated image is then refined using techniques such as Progressive Growing of GANs (PGGAN) to improve its quality and resolution.
  5. Post-Processing: The final generated image undergoes post-processing to adjust its brightness, contrast, color balance, and other parameters.
  6. Output: The generated image is then displayed to the user or saved to a file.

Applications of Stable Diffusion as a text-to-image model

Stable Diffusion has a wide range of applications in computer vision, including image synthesis, image translation, and image manipulation.

Another application of Stable Diffusion is in the field of virtual reality and augmented reality. By generating high-quality images from textual descriptions, it is possible to create immersive virtual environments that are consistent with the user’s input.

Stable Diffusion models trained from scratch
Stable Diffusion models trained from scratch

On March 24, 2023 a new stable diffusion fine-tuned model called Stable unCLIP 2.1 has been released by Hugging Face. It is based on SD2.1-768 and can handle image variations and mixing operations. Instructions for using the model are available on the website. A public demo of SD-unCLIP is already available here.

Learn more: