Microsoft’s Differential Transformer, a new architecture for LLMs

October 20, 2024

Researchers from Microsoft and Tsinghua University introduced the Differential Transformer, which is an improvement upon the standard Transformer architecture in LLMs that cuts attention noise and amplifies attention to the relevant context.

The key innovation lies in the use of differential denoising to filter out noise and focus attention on the most important information. The model is open source and available on GitHub as part of the “microsoft/unilm” repository (MIT license).

The experimental results show that the Differential Transformer (a.k.a. DIFF Transformer) outperforms Transformer in various downstream tasks and offers better scalability in terms of size and training data. For instance, it requires only about 65% of model size or training tokens needed by Transformer to achieve comparable language modeling performance.

Moreover, it integrates Flash Attention, which optimizes the memory usage and computational speed. Flash Attention manages data storage and access during computation, minimizing the need to access slower memory types like DRAM. Instead, it uses faster memory types such as SRAM and HBM. This results in a running speed that is 2-4 times faster than the standard attention mechanism while requiring only 5%-20% of the memory.

Traditional Transformers often pay too much attention to unimportant context

The decoder-only Transformer is currently the standard architecture for LLMs. Its core functionality is based on the attention mechanism, which uses the softmax function to determine the importance of different tokens within a sequence, by giving them different attention scores.

The attention function in a standard Transformer (source: “Attention Is All You Need” paper)

Each query (Q) is compared with all the keys (K), the results are scaled, and a softmax function is applied to obtain the weights. The softmax function assigns normalized weights to each token, determining the level of attention each token receives. The parameter d_k is the size of vectors within the input matrices Q and K.

The next figure illustrates the performance of the standard Transformer and DIFF Transformer in retrieving an answer from a large set of documents.

Transformer often over-attends to irrelevant context (source: paper)

Left side: The standard Transformer assigns normalized attention scores to different parts of the context. It gives only a small amount of attention (0.03) to the correct answer, while focusing more on irrelevant context.
Middle: DIFF Transformer assigns much higher attention scores to the correct answer and much lower scores to irrelevant context compared to the standard Transformer. It increases attention to the correct answer from 0.03 to 0.31 and reduces attention to irrelevant information from 0.18 and 0.34 to just 0.01.
Right side: DIFF Transformer allocates 85% of its attention to relevant tokens, compared to only 55% in the standard Transformer.

The figure indicates that standard Transformers frequently allocate too much attention to irrelevant context, creating attention noise. This is particularity problematic when dealing with long sequences and tasks that require precise information retrieval.

DIFF Transformer eliminates the attention noise

DIFF Transformer eliminates attention noise through a differential attention mechanism. The method is similar to noise cancellation techniques used in headphones and electrical engineering, where the difference between two signals is used to eliminate common-mode noise.

The model generates two distinct softmax attention maps instead of the single map used in a standard Transformer. It projects the input into two separate query (Q) and key (K) vectors through the attention mechanism, resulting in the calculation of two individual softmax attention maps. The attention scores from these two maps are then subtracted to produce the final attention scores. This subtraction highlights the most relevant input information from both maps while reducing the impact of less important data.

The differential attention operator DiffAttn(X) computes a weighted sum of value vectors (source: paper)

The image below illustrates a multi-head differential attention, where each head subtracts one softmax attention map from another to eliminate the attention noise.

Multi-head differential attention. Each head takes the difference between two softmax attention maps to cancel out attention noise (source: paper)

This way, the differential approach helps the model concentrate on the most relevant parts of the input. This improves its ability to handle long-context dependencies, leading to better performance on tasks that require precise information retrieval and in-context learning.

DIFF Transformer’s evaluation

The proposed architecture was evaluated and compared with traditional Transformer in various downstream tasks, such as long-sequence modeling capability, key information retrieval, and in-context learning.

The experimental results show that DIFF Transformer can achieve comparable performance to the standard Transformer while using 38% fewer parameters and 36% fewer tokens (see the picture below).

Language modeling loss of scaling up parameter count and training tokens (source: paper)

DIFF Transformer consistently outperforms the standard Transformer across different model sizes: 830M, 1.4B, 2.8B, 6.8B, and 13.1B. Notably, a 6.8B-size DIFF Transformer achieves the same performance as an 11B-size standard Transformer, using only 62.2% of the parameters. Similarly, a 7.8B-size DIFF Transformer matches the performance of a 13.1B-size Transformer, using only 59.5% of the parameters.

To prove the model’s superior ability to learn from multiple examples (many-shot learning), they used 4 different datasets (see the picture below). Initially, the model was given just one example (1-shot), and then more examples were added until the combined length of these examples reached 64K tokens. The dashed lines on the graph show the average accuracy of the model after it has had enough examples to reach a stable performance level. This helps to understand how well the model can learn and generalize from a large number of examples.

Many-shot in-context learning accuracy on four datasets (source: paper)

Conclusion

The attention noise is a very important issue in Transformers, leading to reduced efficiency and accuracy. DIFF Transformer is a new open-source model that addresses this issue by calculating the attention scores as the difference between two separate softmax attention maps. This subtraction eliminates noise in the attention mechanism and thus helps the model focus on relevant information.

Moreover, DIFF Transformer’s compatibility with FlashAttention increases its efficiency by optimizing memory usage and computational speed. This makes it a promising tool for developing LLMs that use resources more effectively and reduce attention noise, leading to better overall performance.