Transfusion, a multi-modal model for text and image generation

September 20, 2024

Transfusion is a multi-modal AI tool designed to handle both text and images within a unified framework. It uses a single transformer to manage both modalities, combining language modeling for text and diffusion for images.

The model’s unified architecture enables it to generate images and text with quality on par with other diffusion and language models of similar scale. By handling images as patch vectors and text as tokens within a single framework, it reduces both the time and computational resources required, unlike using separate models for each modality.

The figure below shows how the transformer processes both text and images within the same framework.

Transformer high-level overview (source: paper)

It employs distinct mechanisms for each modality. The discrete tokens (text) are processed in an autoregressive manner and trained using the next token prediction objective. Continuous vectors (images) are processed in parallel and trained using the diffusion objective. BOI (Begin of Image) and EOI (End of Image) tokens are used to separate images from text.

How to use it

The model is open-source and can be accessed here. First, you have to install the package using pip and then you need to import the necessary modules and create a model instance for basic usage. If you want to handle multiple different modalities (e.g., text and images), you can specify multiple latent dimensions in the model configuration.

Key features

Unified model: It allows a single model to handle both text (discrete data) and images (continuous data).
Common dataset: The model is trained on a dataset that encompasses both text and images.
Images initially processed in their continuous form. They are then divided into patch vectors for efficient handling and integration with text data. This approach ensures minimal information loss while using the model’s unified architecture for both text and images.
Shared parameters: The majority of the model’s parameters are shared between the two modalities.
Separate loss functions for text and images.

Baseline methods vs Transfusion

A common way for training models to handle both text and images is to treat images like text. The image is broken down into small pieces and converted into discrete tokens, similar to the words in a sentence. These tokens are then fed into a language model, such as GPT. Models like DALL-E 2 and Chameleon are trained in this way.

Transfusion uses a different approach. Instead of cutting images into tokens and losing some details in the process, it keeps the image as a whole and trains the model on a common dataset made of images and text. This method allows the model to learn from both text and image data at the same time, thus improving its understanding and generation capabilities for the two modalities.

The Model

Transfusion is based on a transformer architecture. It uses the same technology behind GPT-4 and BERT (for natural language processing tasks), as well as diffusion models like Stable Diffusion (for image generation). While it follows the Chameleon methodology, Transfusion keeps images in their continuous form, instead of converting them into discrete tokens.

Main components:

Transformer: A single transformer is used to process the entire sequence regardless of modality (text or image).
Text encoding-decoding: A transformer-based text encoder processes language inputs and a transformer-based text decoder generates language outputs from the encoded representations.
Image encoding-decoding: Images are transformed between latent and fixed-size patch vectors, using a pre-trained VAE and either a linear layer or U-Net downsampling.
Hybrid attention mechanism: It uses causal attention for text, which prevents future information leakage and bidirectional attention for images, allowing patches to attend to each other. This approach allows the model to effectively handle both text and image data within the same framework.

The images conversion to and from latent representation is depicted in the next figure.

When processing a specific patch, the model can consider information from nearby patches within the same image, as they condition on each other.

Transfusion allows patches of the same image to condition on each other (source: paper)

Training

The Transfusion model was trained with two distinct objectives: next token prediction for text and image generation. It uses a combined loss function that includes both Language Modeling (LM) loss and diffusion loss.

Transfusion uses a combined loss function (source: paper)

The model was trained with discrete and continuous data, represented by text (tokenized into a sequence of integers) and images (encoded as latent patches).

During experiments, 2T tokens were used, including 1T text tokens and 3.5B caption-image pairs. Additional data includes 220M public images without people and 80M Shutterstock images with people. The dataset is further enriched with Conceptual 12M (CC12M) images, resulting in 692M image-caption pairs per epoch. The images are resized to 256×256 pixels.

Inference

During inference, the model switches between two modes, Language Modeling (LM) and Diffusion, to generate both text and images. The following diagram illustrates the main steps of the inference stage.

Evaluation

Transfusion has demonstrated state-of-the-art performance on a variety of multi-modal benchmarks.

For text-based tasks, it competes with top NLP models in terms of accuracy and fluency in generating coherent text.
For image generation, Transfusion’s diffusion-based approach produces images that are highly realistic, often similar to those generated by leading image generation models.

The next figure showcases some generated images from a 7B Transfusion trained on 2T multi-modal tokens.

Images generated by 7B Transfusion (source: paper)

The table below compares the performance of the largest versions of Transfusion and Chameleon, each with 7B parameters. Both models were trained on the same amount of data, specifically 0.5T tokens.

Performance of Transfusion and Chameleon in a controlled setting (source: paper)

The Parity FLOP Ratio compares the amount of computational effort, measured in FLOPs (Floating Point Operations), that Transfusion needs to achieve the same results as Chameleon. The table highlights Transfusion’s computational efficiency.

Conclusion

Transfusion is a unified multi-modal AI that integrates both text and images into a single framework, outperforming state-of-the art models in various benchmarks.

This makes it highly versatile for applications such as content creation, autonomous systems, and other multi-modal tasks.