MoRA, a high-rank strategy for enhanced fine-tuning of LLMs

,

MoRA (Model Rank Adaptation) is a new method designed to improve the fine-tuning process of large language models. It maintains the same number of trainable parameters as LoRA (Low-Rank Adaptation), but uses high-rank updates to capture task-specific patterns more effectively.

High-rank updates are detailed and complex changes made to the model’s parameters, allowing for a broader range of enhancements to the model’s understanding.

This approach offers multiple benefits compared to the traditional fine-tuning techniques:

  1. Faster fine-tuning
  2. Reduced resource consumption
  3. Scalability and flexibility

The method was developed by a research team from Beihang University and Microsoft Corporation. The repository shows how to set it up for your project. MoRA is implemented in the apply.mora and get_delta_weight functions.

Fine-tuning efficiency increase from LoRA to MoRA

Fine-tuning is a powerful technique for adapting pre-trained models to specific tasks. It involves training a model that has already been trained on a large dataset (pre-trained) on a new dataset specific to a particular task. However, this process requires adjusting many parameters, which can be computationally intensive.

Parameter-efficient fine-tuning methods (PEFT), such as LoRA, address this issue by updating only a few parameters. LoRA uses special small matrices (low-rank matrices) for updates, keeping the main model unchanged. This makes fine-tuning quicker and easier. However, the low-rank updates can be limiting in specialized tasks that require more substantial changes to the model’s knowledge base.

MoRA advances the concept of PEFT by using a square matrix for high-rank updates. This approach allows it to capture more complex patterns and dependencies, which can be particularly suitable for tasks that demand significant modifications to the pre-trained model’s knowledge base.

In summary, both LoRA and MoRA are valuable techniques in the field of machine learning for fine-tuning LLMs, with MoRA providing a more robust method for high-rank updating that can be beneficial for more complex tasks.

The method

MoRA aims to make updates more effective than LoRA, without increasing the number of trainable parameters. Instead of using two small matrices like LoRA does (A and B), MoRA uses one big square matrix (M) for its updates, allowing for higher-rank adjustments.

An overview of MoRA compared to LoRA under same number of trainable parameters (source: paper)

Since the dimensions of this matrix differ from those of the original model’s weight matrices, direct matrix multiplication with the original weights is not possible. To solve the problem, the researchers have come up with special functions that can compress and decompress data. These functions change the size of the inputs so they fit with the MoRA matrix M. After the updates are made, they then expand the outputs back to their original size.

This flexibility allows MoRA to be easily integrated into various LLMs, regardless of their size and architecture, making it a versatile tool for fine-tuning.

MoRA integrates into the fine-tuning process in the following way:

  1. Identify the most critical parameters for the new task.
  2. Decompose the parameter matrices into high-rank components, allowing for more nuanced updates.
  3. Update the high-rank components by focusing on the high-rank aspects, to maximize the efficiency of the fine-tuning process.
  4. Recompose the high-rank components back into the model. This step ensures that the modifications are fully integrated into the model.

It’s important to notice that MoRA maintains the same number of trainable parameters as LoRA, but its high-rank updating mechanism allows for a more robust adaptation of the model.

Evaluation

MoRA was evaluated on various tasks and compared against existing methods such as LoRA, ReLoRA, and FFT. The evaluation was structured across 3 key areas: (I) memorizing UUID pairs, (II) fine-tuning tasks, and (III) pretraining.

(I) Memorizing UUID pairs: UUID (universally unique identifiers) are 128-bit numbers used to uniquely identify items in computer systems. In the experiment, they are used as key-value pairs to test the memorization capabilities of various methods. MoRA needed fewer training steps to memorize the UUID pairs when compared to LoRA. Additionally, both FFT and MoRA managed to memorize all UUID pairs within 500 training steps, showing the same efficiency in this task.

Rank300500700900
FFT42.5100100100
LoRA89.910.010.754.2
MoRA810.115.787.4100
LoRA2569.970.6100100
MoRA25641.6100100100
Character-level accuracy of memorizing UUID pairs by generating the value of corresponding key in 300, 500, 700 and 900 training steps (source: paper)

(II) Fine-tuning tasks: MoRA was evaluated on three fine-tuning tasks for LLMs: instruction tuning, mathematical reasoning, and continual pretraining. The table below shows that MoRA performed on par with LoRA in instruction tuning and mathematical reasoning tasks. Notably, MoRA outperformed LoRA in the biomedical and financial domains. This highlights its superior capability to memorize new knowledge through high-rank updating.

Performance of MoRA, FFT, LoRA, and LoRA variants on instruction tuning, mathematical reasoning and continual pretraining tasks (source: paper)

(III) Pretraining: In the pretraining evaluation, the transformer models were trained from scratch on the C4 dataset using LLaMA-based models to assess the impact of high-rank updating. MoRA exhibited superior pretraining performance in comparison to LoRA and ReLoRA. The introduction of ReMoRA, which involves merging the matrix M back into the original parameters during training, resulted in even further enhancements.

Conclusion

MoRA is a new approach to fine-tuning LLMs by introducing high-rank updates. These updates are more effective at capturing complex, task-specific patterns than the low-rank updates used by previous methods like LoRA.

Read more:

Other popular posts