Meta AI releases MEGABYTE, a novel AI architecture to predict million-byte sequences without tokenization

May 31, 2023

Meta AI releases MEGABYTE (Multiscale Encoder-Generator BYte Transformer), a new powerful AI model that can produce over 1 million tokens in various formats without using any tokenization or preprocessing.

The new architecture overcomes the drawbacks of the current Transformer models (which are the basis of GPT-3 and GPT-4) that become very slow and inefficient when processing long sequences of data.

MEGABYTE solves this problem by splitting long sequences into patches. Additionally, the model decreases the computational expenses by employing per-patch feedforward layers instead of per-position ones.

This method also boosts the generation speed by processing patches in parallel, unlike conventional transformers that perform computations sequentially during generation.

The paper shows that MEGABYTE outperforms the existing byte-level models across a range of tasks and modalities, such as language modeling, image generation, and audio synthesis.

For example, a MEGABYTE model with 1.5B parameters generates sequences 40% faster than a standard 350M transformer, with the same resources.

The motivation for MEGABYTE

Most of the current generative models use tokenization and have difficulties with long byte sequences that are common in various types of data such as images, music or videos. For example, a song file might have 5 million bytes.

These models rely on transformers, which use self-attention methods to process sequential data. Self-attention allows the model to learn how different parts of the data relate to each other and how important they are. For instance, a large transformer decoder can produce a summary of a text by using self-attention to identify the main ideas and details of the text.

The standard self-attention has a quadratic cost, meaning it scales with the square of the input length. A transformer language model needs to pay attention to all other words for each word it processes or generates. The more bytes the data has, the more time and memory the model needs to process them.

The architecture

As shown in the image below, MEGABYTE has a local model and a global model.

The local model (which is smaller) generates each patch byte-by-byte, using the output of the global model (which is larger) as a guide.

The global model captures the overall pattern of the sequence, while the local model fills in the fine details of each segment.

The model’s pipeline:

Byte embedding: The input sequence of bytes is embedded into a vector representation.
Patch segmentation: The embedded sequence is split into patches of fixed size, which are similar to tokens.
Global transformer: A large decoder-only transformer generates a contextualized representation for each patch by attending to the previous patches in the sequence.
Local transformer: A smaller decoder-only transformer generates the next patch by attending to the bytes within the current patch and the contextualized representation from the global transformer.
Patch generation: Based on the output of the local transformer it generates the output patch.

Training & evaluation

MEGABYTE was trained on ImageNet 64×64 and five other datasets, with Global and Local models sized 2.7B and 350M parameters, respectively, for 1.4T tokens.

The team used Metaseq2 code base, PyTorch framework, fairscale (that helps to reduce the amount of memory needed) and mixed precision training technique to speed up the training process and use less memory.

They tested their model on generating images from text descriptions using ImageNet 64X64, and compared it with a standard decoder-only transformer and the PerceiverAR model.

The results showed that MEGABYTE can produce high-quality and diverse outputs that are comparable or superior to existing models. MEGABYTE has a faster and cheaper generation process than the existing models.

Key advantages of the MEGABYTE model:

Compared to Transformers, the MEGABYTE architecture has three major improvements in modeling long sequences:

It reduces the self-attention costs by breaking down long sequences into shorter ones and using optimal patch sizes.
It reduces the computational costs by using per-patch instead of _per-position feedforward layers. In large models like GPT3, a substantial portion of computations (over 98% of FLOPS) are dedicated to per-position feedforward layers.
It improves the generation speed by computing the patches simultaneously.

Conclusion

MEGABYTE is a breakthrough in generative modeling with a new multiscale decoder architecture to model long sequences efficiently and fast.

It splits sequences into patches and can model sequences of over one million bytes without tokenization, opening up new possibilities for creating and manipulating long sequences of data.

Its architecture has some benefits over existing approaches, such as lower memory usage, higher parallelism, and better scalability.

Future work should explore scaling the model to larger models and datasets.