Hunyuan-Large, the largest open-source Mixture of Experts model from Tencent

November 26, 2024

Tencent released Hunyuan-Large, the largest open-source Transformer-based Mixture of Experts model to date, having 389B parameters and 52B activated parameters.

Mixture of Experts (MoE) is an advanced neural network architecture that incorporates specialized sub-models, known as experts, into specific layers of a larger model. Each expert is trained on a distinct subset of the data, thereby becoming an expert in that particular area.

Unlike traditional neural networks that activate all their parameters for every computation, MoE models dynamically select and activate a small subset of these experts for each input. A gating network decides which experts are most relevant for a given task, and the final output is a weighted combination of the outputs from the selected experts.

Hunyuan-Large includes 1 shared expert for the common knowledge required by all tokens and 16 specialized experts that are dynamically allocated for domain-specific knowledge. The model can process a relatively large input context window of 256M tokens and has specialized LLM features such as mathematics, coding, multi-turn dialogue, and multilingual support, along with the standard NLP functionalities like question answering, reasoning, and reading comprehension. In comparison, OpenAI’s GPT-4o supports a maximum context window of 128K tokens.

You can check out the model on Hugging Face and explore the code on GitHub.

According to the evaluation results, Hunyuan-Large significantly outperforms Llama3.1-70B across various benchmarks and matches the performance of the much larger Llama3.1-405B model.

These technical innovations lead to Hunyuan-Large’s performance:

Large-scale synthetic data: The model is pre-trained 7T tokens, which contains nearly 1.5T tokens of high-quality synthetic data.
Expert routing with both shared and specialized experts, using a “recycle routing” mechanism for token allocation.
Key-Value cache compression technique: This technique optimizes the model’s memory usage by compressing the key-value cache. This leads to faster inference times and lower memory requirements.
Expert-specific learning rate scaling: Each expert has its own learning rate, improving the overall training process.

Tencent’s AI chatbot, Yuanbao, which was officially launched on May 30, 2024, is also powered by the Hunyuan-Large model. The chatbot shows high capabilities in performing diverse tasks such as practicing English, translating everyday scenarios, and generating avatars.

Most open-source language models, such as Llama, Mistral, Qwen, and DeepSeek, are primarily built using dense architectures. They process information through a single, large neural network. In contrast, MoE models distribute the tasks among multiple smaller expert networks. This allows for greater flexibility and scalability, as the model can dynamically allocate computational resources to the most relevant experts for a given input.

However, MoE models activate only a small subset of their total parameters for any given task. For instance, Switch Transformers, a MoE model developed by Google in 2021, has a total of 1.6T parameters, but only about 2.6B parameters are active during each forward pass due to the top-k routing mechanism, which activates only a subset of experts for each token. While this approach reduces computational costs, it may also limit the model’s capacity to learn complex patterns and generate detailed responses.

Hunyuan-Large overcomes this challenge by activating a significantly larger number of parameters (52B) during each forward pass.

How to use

Access: The model’s code and weights are available on GitHub repository and Hugging Face. These platforms provide detailed instructions on how to clone the repository, install dependencies, load the model, and run it.

License: Hunyuan-Large is available under the Tencent Hunyuan Community License. Developers can use it for free if their products have fewer than 100M monthly active users. If the number of users exceeds 100M per month, additional licensing from Tencent is required. The license is valid worldwide, except in the European Union, where usage is restricted.

The model

The model includes 1 shared expert and 16 specialized experts and was developed in 2 distinct stages: pre-training and post-training.

During training, only a subset of experts is activated to process a given input. Specifically, 1 shared expert and 1 specialized expert are activated for each token. This selective activation mechanism significantly reduces computational overhead while maintaining model performance.

The token vocabulary consists of 100K tokens from OpenAI’s tiktoken tokenizer, supplemented by 28K additional tokens designed to improve support for the Chinese language.

Pre-training: During the pre-training stage, the model acquires the fundamental capabilities of an LLM. The pretraining data consists of a natural text corpus focused on Chinese and English languages. To enhance the model’s knowledge in areas such as mathematics, coding, low-resource languages, and high-educational-value topics, a large amount of synthetic data is generated using a four-step process, as illustrated in the following picture:

The four-step process of data synthesis in Hunyuan-Large’s pre-training: (1) Instruction generation, (2) Instruction evolution, (3) Response generation, and (4) Response filtering (source: paper)

Instruction generation: High-quality instructions are created from diverse sources like web pages, question-answering data, code repositories, and books.
Instruction evolution: Instructions are refined to make them clearer, more informative, and more challenging.
Response generation: The specialized models generate accurate and informative answers.
Response filtering: Low-quality data is filtered out, keeping only the best training text. A specialized model is used to evaluate and “critique” the quality of generated instruction-response pairs.

To reduce memory consumption and inference costs, two KV cache compression techniques are employed: Grouped-Query Attention (groups the key-value (KV) heads into 8 groups, reducing redundancy) and Cross-Layer Attention (shares KV cache between adjacent layers).

The model employs advanced routing methods. While traditional routing selects the top-k scoring experts to process each token, Hunyuan-Large activates 1 shared expert for all tokens and assigns 1 specialized expert to each token based on the highest score.

Conventional top-k routing strategies use a capacity factor to set the maximum load an expert can handle. When an expert is overloaded, some tokens are discarded, potentially leading to loss of crucial information and affecting training stability. To address this issue, Hunyuan-Large introduces a recycle routing strategy, where tokens initially routed to overloaded experts are randomly reallocated to other specialized experts that have not reached their capacity.

The following picture illustrates the recycle routing strategy in Hunyuan-Large, where each expert has a maximum capacity of 2. Token D is reallocated from expert 1 to expert 4.

The recycle routing strategy in Hunyuan-Large (source: paper)

Each expert is trained on a specific learning rate, ensuring that each expert learns at an optimal pace.

Post-training: This stage is designed for task-specific instruction following, skill enhancement, and alignment with human preferences. It includes two steps: Supervised Fine-Tuning (fine-tuning on labeled datasets for specific tasks) and Reinforcement Learning from Human Feedback (aligning the model with human values and preferences through iterative feedback).

Evaluation

Both the pre-trained and post-trained versions of Hunyuan-Large were evaluated to measure their effectiveness across a diverse range of tasks. The evaluation covered both Chinese and English and included mathematics and reasoning, coding, reading comprehension, commonsense understanding, long-context processing, and aggregated tasks.

The next table highlights the high performance of Hunyuan-Large pre-trained model compared to competitive pre-trained models, including both dense and MoE-based architectures with similar activated parameter sizes.

Performance of Hunyuan-Large’s pre-trained model and its competitors (source: paper)

It achieves top results across various benchmarks. On aggregated tasks such as MMLU, it outperforms the Llama 3.1-405B model by 3.2%, despite using significantly fewer activated parameters.

The Hunyuan-Large-Instruct‘s performance on various public benchmarks is illustrated in the table below. On the MMLU dataset, the model achieves a 2.6% improvement over Llama 3.1-405B, reflecting superior understanding and reasoning across diverse language tasks. It also surpasses Llama 3.1-405B by 3.6% on the MATH dataset, highlighting its advanced mathematical reasoning capabilities.

Performance of Hunyuan-Large-Instruct and its competitors (source: paper)

Conclusion

Hunyuan-Large is currently the largest open-source Transformer-based MoE model. It outperforms Llama 3.1-70B and offers comparable performance to the much larger Llama 3.1-405B. The model’s success is attributed to high-quality pre-training data, an optimized pre-training process, an improved model design with a recycle routing strategy, and expert-specific learning rates.

Its open-source nature allows for flexibility in fine-tuning and customization according to your needs.