PIXART-α: 10x faster text-to-image diffusion model with state-of-the-art results


PIXART-α is a new text-to-image (T2I) diffusion model based on the transformer architecture that can generate high-resolution images up to 1024 × 1024 pixels, having a fast and low-cost training process, as compared to previous state-of-the-art models.

PIXART-α only requires 10.8% of Stable Diffusion v1.5’s training time. Moreover, its training cost is only 0.85% of a larger state-of-the-art model like RAPHAEL.

It creates images with realistic and diverse details that match the text prompts, competing with the latest T2I generative models, such as Imagen, SDXL, and even Midjourney.

The next pictures show that the images generated by PIXART-α demonstrate outstanding quality and accurately match the textual descriptions provided.

A small cactus with a happy face in the Sahara desert. (source: paper)
An alpaca made of colorful building blocks, cyberpunk. (source: project page)
Half human, half robot, repaired human. (source: project page)

The following image compares the CO2 emissions and the training cost of different T2I generators. PIXART-α has a very low training cost of $26,000. The CO2 emissions and training costs of PIXART-& are only 1.1% and 0.85% of RAPHAEL’s, respectively. The data usage is less than 0.2%

Comparisons of data usage, training time, CO2 emission, and training cost among T2I generators. (source: paper)

Model architecture

The model architecture of PIXART-α is based on the Diffusion Transformer (DiT), which is a transformer-based generative model that uses a diffusion process to produce images from noise (see the next picture).

Model architecture of PIXART-α (source: paper)

The authors used the DiT-XL/2 as their base network architecture. It has a T5 text encoder that can extract conditional features from the input text and a pre-trained variational autoencoder (VAE) from Latent Diffusion Model (LDM). The images are resized and center-cropped before being fed into the VAE to ensure the same size.

To adapt the DiT model for text-to-image synthesis, PIXART-α incorporates three core designs:

  1. Training strategy decomposition: the model learns to generate images from text in three steps:
    1. Pixel dependency learning. This step is to teach the model to identify the different types of objects and scenes that can appear in images. For example, if the text is “a cat”, the model learns to create pixels that look like a cat.
    2. Text-image alignment learning. This step teaches the model to match the words in the text with the corresponding parts of the image. For example, if the text says “a blue car”, the network should learn to generate an image that has a blue car in it.
    3. High-resolution and aesthetic image generation. This step improves the image’s style and color.
  2. Efficient T2I Transformer: the model uses a modified version of the Diffusion Transformer (DiT) that is more efficient at generating text-to-image synthesis.
  3. High-informative data: PIXART-α uses high-quality training data that is rich in concepts. For example, instead of using a simple caption like “a dog”, the model uses a more detailed caption like “a brown dog with a blue collar sitting on a sofa”. The model also uses a large Vision-Language model to create these detailed captions automatically.

Overall, these three core designs work together and enable the PIXART-α model to generate photorealistic images from text prompts much faster than previous diffusion models.

Data construction

The authors used three different datasets for their training process:

  • LAION: This dataset has images of products from shopping websites, such as clothes, shoes, and bags.
  • SAM: This dataset has images of scenes with many objects, such as cars, people, animals, and buildings.
  • Internal: This dataset has images that are high-quality and beautiful, such as landscapes, portraits, and artworks.

As the original captions for these datasets were not very relevant or aesthetic, the authors used LLaVA to generate new captions with higher concept density.

They claimed that using high concept density datasets helped their network learn better text-image alignment and produce more realistic and diverse images.


PIXART-α wad evaluated using three main metrics and an ablation study:

  • Fidelity Assessment. PIXART-α was compared with other methods for T2I generation, using the FID metric. PIXART-α achieved a low FID score on the COCO dataset, which means high image quality, while using much less training resources than other methods.
  • Alignment Assessment. The researchers evaluated the performance of PIXART-α on T2I-Compbench, a benchmark that measures the compositional T2I generation capability. PIXART-α performed very well on most (5/6) of the evaluation metrics.
  • User study: It involved 50 individuals who ranked the models based on the images they generated from 300 prompts. PIXART-α surpassed the other models, such as DALLE-2, SDv2, SDXL, and DeepFloyd, in both visual quality and text-image alignment.
  • Ablation study: The researchers compared four variants of the model for generating images from text based on their structure, re-parameterization design, and training. They showed that the final design (PIXART-α with the adaptive normalization layers: adaLN-single-L) has the best performance in terms of image quality, FID score, memory consumption, and parameter efficiency.

PIXART-α can be combined with Dreambooth (see the pictures below).

You can use Dreambooth + PIXART-α to generate images aligned with your text prompts. (source: paper)
You can change the color of a specific object like Wenjie M5 with Dreambooth + PIXART-α. (source: paper)


PIXART-α is a new T2I model that can generate photorealistic images from text descriptions with a fast training time and a significantly lower cost than previous state-of-the-art models.

Learn more:

Other popular posts