Alibaba released Qwen2.5 with more than 100 open-source AI models

October 11, 2024

Alibaba Cloud recently announced the release of over 100 open-sourced Qwen 2.5 multimodal models and a Text-to-Video generation tool. These models cover a wide range of applications, including language, audio, vision, code, and math. Described as one of the largest open-source models ever released, they range in size from 0.5B to 72B parameters and support over 29 languages.

Qwen2.5 is the latest upgrade of Qwen, a series of large language and multimodal models developed by Alibaba Cloud, first released in April 2023. It is accompanied by specialized models for coding and mathematics.

Article sections:

Getting started: how to use the models
Qwen2.5-coder: how well it can code
Qwen2.5-math: its capabilities for solving math
Evaluation: benchmark results
Conclusion
Links

Alibaba Cloud has also upgraded its flagship model, Qwen-Max. It now matches the performance of leading models in fields like language comprehension, reasoning, mathematics, and coding (see the table below).

Qwen2.5-Max demonstrates strong performance (source: blog)

Alibaba introduced a new text-to-video generator that creates high-quality videos from text prompts in English and Chinese. It can produce videos in various styles, from realistic scenes to 3D animations, and can turn static images into dynamic videos based on text descriptions.

The vision language model, Qwen2-VL, was also updated. This model can now understand and analyze videos over 20 minutes long and answer questions based on the video content. It’s designed for use in devices like smartphones, cars, and robots.

The table below shows a comparison of different models:

Feature	Qwen	Qwen2	Qwen2.5
Model Size	1.8B, 7B, 14B	0.5B, 1.5B, 7B, 72B, 57B-A14B (14B parameters active)	Qwen2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B Qwen2.5-Coder: 1.5B, 7B, and 32B on the way Qwen2.5-Math: 1.5B, 7B, and 72B.
Pre-training Dataset Size	3T tokens	7T tokens	18T tokens (maximum)
The model’s general knowledge	Good	Improved	Significantly improved
Coding	Good	Improved	Qwen2.5-Coder is greatly improved, trained on 5.5T tokens of code-related data
Math	Good	Improved	Qwen2.5-Math greatly improved
Open-source	Not all sizes	Not all sizes	All sizes except 3B and 72B
Availability	API services via Alibaba Cloud Model Studio	API services via Alibaba Cloud Model Studio	APIs for the flagship language models: Qwen-Plus and Qwen-Turbo through Model Studio
Supported languages	2	Over 29	Over 29
Process	32K tokens	128K tokens	128K tokens
Generate	8K tokens	8K tokens	8K tokens

Main features of the Qwen models (source: Qwen’s documentation)

Key improvements of the latest version, Qwen2.5, over its predecessors:

Significantly more knowledge (MMLU: 85+)
Greatly improved capabilities in coding (HumanEval 85+)
Improved mathematics (MATH 80+)
Better instruction following, generating long texts (over 8K tokens)
Understands structured data (e.g, tables)
Generates structured outputs especially JSON.
More resilient to the diversity of the user’s prompts

Qwen2.5 offers a high-performance without the massive overhead of models like GPT-4 or LLaMA-3.1. Despite having fewer parameters than these larger models, Qwen2.5 is specifically optimized to handle a wide range of tasks with high accuracy and speed.

For example, the base language model of their flagship open-source model, Qwen2.5-72B, performs remarkably well even when compared to larger models such as Llama-3-405B. This demonstrates its ability to compete with, and even surpass, larger counterparts in terms of performance.

Getting started with Qwen2.5

There are several ways to use Qwen2.5, offering flexibility based on your technical setup:

Hugging Face Transformers: This is one of the easiest methods for developers, where you can integrate the model using the Transformers library by Hugging Face. It involves a few lines of Python code to load and run Qwen2.5 locally.
Ollama: This allows you to run Qwen2.5 models locally. After setting up the Ollama service, you can use commands to load specific model checkpoints and interact with the model.
vLLM: This framework is designed for efficient model deployment. It lets you deploy Qwen2.5 as an OpenAI-compatible API service, useful for large-scale inference.
ModelScope: This platform provides another way to access Qwen2.5 and is recommended for users in mainland China.

Read the full instructions in the Quickstart section of the GitHub repository. It lists other methods such as: llama.cpp, MLX-LM (for Apple Silicon), LMStudio and OpenVINO. You can also test the Qwen2.5 models online using this Hugging Face space.

Qwen2.5-Coder

Qwen2.5-Coder is designed specifically for coding tasks and is an upgraded version of CodeQwen1.5. It comes with two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B. Built on the Qwen2.5 architecture, it has been further pretrained on a vast corpus of over 5.5 trillion tokens.

The model can understand and generate code across 92 programming languages. It creates high-quality code snippets, functions, and even entire modules based on simple user prompts.

Qwen 2.5 Coder continuously learns from user interactions, improving its performance over time. This adaptive learning capability ensures that the AI becomes more adapted to individual coding styles and preferences.

Qwen2.5-Math

The Qwen2.5-Math series is created for advanced math reasoning and problem-solving. It includes Qwen2.5-Math and Qwen2.5-Math-Instruct models (1.5B, 7B, and 72B). A key feature of these models is their self-improvement mechanism, which is used throughout their development stages: pre-training, post-training, and inference.

During pre-training, Qwen2-Math-Instruct generates large-scale, high-quality mathematical datasets.
In the post-training phase, a reward model (RM) is created through extensive sampling and iteratively refined through supervised fine-tuning (SFT). This RM helps evaluate model outputs, offering feedback that improves the quality of the solutions produced.
During inference, the RM guides the sampling process, helping to improve the quality of responses, especially for complex mathematical reasoning tasks. The Qwen2.5-Math-Instruct models support both English and Chinese languages, showcasing advanced reasoning capabilities such as Chain-of-Thought (CoT) reasoning, Program-of-Thought (PoT) reasoning, and Tool-Integrated Reasoning (TIR).

Qwen2.5 evaluation

Qwen2.5-72B-Instruct has been benchmarked against leading open-source models like Mistral-Large2-Instruct, Llama-3.1-70B-Instruct and Llama3.1-405B-Instruct. The results shown in the next table indicate that Qwen2.5-72B is highly capable, competing with similar and even much larger models. The evaluation results can be seen in the next table.

Qwen2.5-72B-Instruct scores (source: blog)

Remarkably, Qwen2.5-72B outperforms larger models such as Llama-3-405B on certain benchmarks, such as MMLU (Massive Multitask Language Understanding), MBPP (Mostly Basic Python Programming), GSMBK (Grade School Math Benchmark), and mathematical problems from various domains (see the table below).

QwenPlus demonstrates competitive results against other proprietary models like GPT4-o and Claude-3.5-Sonnet, but needs to be improved for some benchmarks, like human evaluation and math reasoning.

The Qwen2.5-Math models were evaluated on 10 mathematics datasets, ranging from basic to competition-level problems, such as GSM8K, AMC23, and AIME24.

The flagship Qwen2.5-Math-72B model outperforms both open-source and leading closed-source models like GPT-4 and Gemini Math-Specialized 1.5 Pro, especially in the challenging AMC 2023.

Qwen2.5-Coder achieves state-of-the-art performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming other models of the same size.

Conclusion

Qwen2.5 balances performance and efficiency, delivering high capabilities without needing extensive infrastructure. As an open-source suite with over 100 models, it can be tailored to your needs and computational resources, making it suitable for applications like building chatbots, fine-tuning for specific tasks, and generating creative content.