LLaMA-Adapter with zero-init attention for efficient fine-tuning language models

April 10, 2023

LLaMA-Adapter is a lightweight technique used to fine-tune LLaMA (a pre-existing language model) and to make it more efficient for instruction following tasks, while still preserving its pre-trained knowledge.

Despite having significantly fewer parameters (1.2 million) compared to the Alpaca model (7 billion), LLaMA-Adapter can achieve similar performance.

LLaMA-Adapter vs Alpaca

Both LLaMA-Adapter and Alpaca are designed to facilitate large language models to perform instruction following tasks. However, they utilize different methods to fine-tune the pre-trained models.

LLaMA-Adapter uses learnable adaptation prompts and zero-init attention to achieve a faster and a more targeted fine-tuning with fewer parameters. It is a lightweight approach that requires small computational and time costs.
Alpaca is a more complex approach that uses adapter modules inserted between the layers of the pre-trained model. It offers more flexibility for handling multiple tasks but requires more computational resources and time for training.

The model

The main concept behind the LLaMA-Adapter technique is to use a set of learnable adaption prompts and zero-init attention in order to fine-tune the pre-trained language model LLaMA, with task-specific data more efficiently and effectively.

The technique inserts lightweight adapters with learnable prompts into a subset of the transformer layers of the LLaMA model, allowing the prompts to progressively learn new instructional cues without disturbing the original pre-trained knowledge.

The zero-init attention mechanism is used to prevent interference between the pre-existing knowledge in the network and the task-specific information required for fine-tuning.

Loss curves of LLaMA-Adapter with and without the zero-init attention

The loss curves depicted in the image above demonstrate that using zero-init attention results in faster convergence and lower final loss values compared to randomly initializing the attention weights assigned to the input tokens.

The training dataset was made of 52K instruction-output pairs for training from the Alpaca model. They incorporate new instructional cues while still preserving its pre-trained knowledge by using the zero-init attention mechanism with zero gating.

The method only introduces 1.2M learnable parameters upon the frozen LLaMA 7B model and costs less than one hour for fine-tuning on 8 A100 GPUs.

The authors also show that LLaMA-Adapter can be extended to multi-modal input, such as images, to improve reasoning capacity on tasks such as ScienceQA (a benchmark dataset for scientific question answering).

Model evaluation

During the evaluation, the research team compared LLaMA-Adapter with other representative instruction-following models, including Alpaca, Alpaca-LoRA, and GPT-3. LLaMA-Adapter was also compared with LLaMA-I, which is LLaMA-65B fine-tuned on large-scale instructional data.

Results indicate that LLaMA-Adapter can generate reasonable responses comparable to fully fine-tuned models, demonstrating the effectiveness of adapters with zero-init attention.

By including image tokens in adaptation prompts, LLaMA-Adapter achieves competitive performance on the ScienceQA benchmark.

Main achievements of the model

LLaMA-Adapter has only 1.2 million parameters, which are used to fine-tune the adaptation prompts, rather than updating the entire set of 7 billion parameters in the pre-trained LLaMA model. Despite having fewer parameters, LLaMA-Adapter demonstrates comparable performance to the Alpaca model after mastering the 7B parameters.
It can be fine-tuned in less than an hour with the help of eight A100 GPUs, being three times quicker than Alpaca, thanks to its lightweight parameters and zero-init attention.
It can handle image input for multi-modal reasoning.

Future research

In future work, LLaMA-Adapter can be enhanced by combining larger LLaMA models, increasing the amount of training data, and scaling up its learnable parameters.

The authors also plan to integrate wider multi-modal inputs, such as audio and video, and conduct experiments on diverse benchmarks.