Qwen3 by Alibaba, a new open-source model with hybrid reasoning

May 11, 2025

Released on April 28, 2025, Qwen3 is an open-source multimodal LLM that extends the Qwen family of models from Alibaba Cloud. It features hybrid reasoning capabilities, supporting both thinking and non-thinking modes, along with an extended context window of up to 128K tokens.

Key features

Hybrid thinking modes, allowing it to switch between different reasoning approaches for complex tasks.
Multilingual and multimodal support, handling 119 languages and various input types, such as text, code, image, audio, and video.
Improved agentic capabilities, enabling advanced function-calling and integration with external tools.

This launch includes 6 dense models (0.6B, 1.7B, 4B, 8B, 14B, and 32B parameters) and 2 Mixture-of-Experts (MoE) models (30B with 3B active, and 235B with 22B active).

Qwen3 dense models offer an extended context window of up to 128K tokens, far surpassing the typical 4K–32K token limits seen in many LLMs. This advancement enables Qwen 3 to process and analyze very large inputs, such as entire books, lengthy legal contracts, multi-turn conversations, and extensive datasets.

Accessing the models

Qwen3 models are open-source and accessible on Hugging Face, ModelScope (for users in China), and Kaggle.

For efficient deployment, frameworks like SGLang and vLLM are recommended due to their scalability. For local integration, tools such as Ollama, LMStudio, MLX, llama.cpp, and KTransformers offer streamlined functionality.

You can try the model on Qwen Chat Web (chat.qwen.ai) and mobile app.

Hybrid thinking modes

Qwen 3 introduces a hybrid reasoning mechanism, allowing the model to switch between:

Thinking mode, for complex tasks like mathematics and coding. This is a slow, deliberate problem-solver, going through the steps methodically.
Non-thinking mode for general-purpose responses. This is a quick responder, giving answers almost instantly.

The integration of these two modes allows users to configure task-specific computational budgets with greater precision. For challenging problems, extended reasoning can be employed, while simpler questions can be addressed with minimal latency, by using the non-thinking mode.

The figure below illustrates Qwen3’s reasoning performance across multiple benchmarks, comparing Thinking Mode and Non-thinking Mode with varying thinking budgets (measured in thousands of tokens). Performance improves as the thinking budget increases, showing clear trends across different benchmarks:

AIME24: Rises from ~40% to over 80%
AIME25: Increases from ~30% to above 80%
LiveCodeBench (v5): Improves from ~45% to 67%
GPQA Diamond: Smaller increase, from ~64% to 72%

Thinking Mode consistently benefits from a larger budget, particularly in AIME24 and AIME25. While the increase is less pronounced in GPQA Diamond, it still shows a positive correlation.

Multilingual support

Qwen3 supports 119 languages and dialects, enabling users to implement advanced AI solutions in diverse regions, in both major world languages and underrepresented dialects.

Improved external interactions

Using the Qwen-Agent framework, the model can seamlessly interact with external tools and environments, particularly in coding tasks and complex agent-based scenarios. Qwen3 integrates the Model Context Protocol (MCP), an open standard that defines how the model communicates with external systems. Acting as a universal interface, MCP enables Qwen3 to use multiple tools in sequence during its reasoning process, using intermediate results between steps. This improves the model’s performance in multi-step, agent-based workflows.

The following examples illustrate Qwen3’s reasoning process and its interaction with external tools or environments.

Pre-training

Qwen3’s pretraining involved a significantly larger and more diverse dataset than Qwen2.5, expanding from 18T to approximately 36T tokens across 119 languages. The data was sourced from the web and structured documents, with Qwen2.5 variants used for text extraction and quality enhancement. Synthetic math and code data were generated using Qwen2.5-Math and Qwen2.5-Coder.

The pretraining process had 3 stages:

Stage 1: Over 30T tokens with a 4K context length to establish core language and knowledge capabilities.
Stage 2: Added 5T tokens focused on STEM, coding, and reasoning to increase knowledge depth.
Stage 3: Introduced high-quality long-context data to extend the model’s context length to 32K, enhancing its ability to process long inputs.

Qwen3 dense base models match the performance of larger Qwen2.5 models due to improvements in architecture, training data, and methods. Qwen3-1.7B/4B/8B/14B/32B-Base performs as well as Qwen2.5-3B/7B/14B/32B/72B-Base. In STEM, coding, and reasoning, Qwen3 even surpasses bigger Qwen2.5 models. Meanwhile, Qwen3-MoE models achieve similar results to Qwen2.5 dense models while using only 10% of the active parameters, significantly reducing training and inference costs.

Post-training

The post-training pipeline focuses on fine-tuning and reinforcement learning to develop the hybrid reasoning and rapid response capabilities of Qwen3 (see the picture below).

The post-training includes 4 stages:

Long chain-of-thought (CoT) cold start: The model was fine-tuned on diverse long CoT data across various domains such as mathematics, coding, logical reasoning, and STEM to build foundational reasoning skills.
Reasoning-based reinforcement learning (RL): Computational resources were scaled up to apply RL with rule-based rewards, improving the model’s ability to explore and exploit reasoning strategies.
Thinking mode fusion: Non-thinking capabilities were integrated into the thinking model, enabling both reasoning and quick response abilities. To achieve this, the model was fine-tuned on a mix of long CoT data and general instruction-tuning data. Notably, this training data was generated by the thinking model during the second stage.
General reinforcement learning: RL was applied to over 20 general-domain tasks — including instruction following, format adherence, and agentic functions — to further improve overall capabilities and correct undesirable behaviors.

Evaluation

Qwen3 builds on its predecessors with significant improvements in reasoning, multilingual understanding, and tool use. To assess its real-world performance, the models were evaluated and compared to leading open-source and proprietary models using standardized benchmarks, public leaderboards, and human preference.

The following benchmarks were employed:

Benchmark	What it measures
ArenaHard	Measures the win rate (%) in direct comparisons on complex prompts. Shows the general LLM quality in reasoning, helpfulness, and coherence on challenging real-world tasks.
AIME’24 / AIME’25	Tests the accuracy on math problems from the American Invitational Mathematics Exam.
LiveCodeBench	Evaluates the code generation ability on real-world programming challenges.
CodeForces (Elo Rating)	Estimates the Elo-style score based on solving competitive programming problems. Helps compare algorithmic coding ability between models.
Aider (Pass@2)	Measures the success rate for code edits or completions within two tries. Shows how useful the model is in interactive coding workflows.
LiveBench	Tests how well the model handles a wide range of practical challenges, using a mix of coding, math, and reasoning tasks.
BFCL (v3)	Measures the accuracy in structured function calling and tool use. Important for evaluating models’ abilities in multi-step API interactions or agentic tasks.
MultiIF	Shows the performance on inference tasks in 8 different languages. Checks the model’s multilingual understanding and flexibility.
GPQA	Tests deep factual knowledge and reasoning skills on graduate-level multiple choice questions on science and logic.

The table below shows the evaluating results of Qwen3-235B-A22B and Qwen3-32B. Qwen3-235B-A22B delivers strong performance across coding, mathematics, and general reasoning tasks, matching or outperforming top models like DeepSeek-R1, Gemini-2.5-Pro, and Grok-3 in various benchmarks.

Qwen3-235B-A22B and Qwen3-32B capabilities (source)

As can we see in the table below, the smaller MoE model, Qwen3-30B-A3B, achieves better results than QwQ-32B while using only 10% of activated parameters during inference.

Additionally, the compact Qwen3-4B rivals the capabilities of the much larger Qwen2.5-72B-Instruct model. This efficiency stems from MoE architectures and optimized training, enabling high performance at reduced computational costs.

Qwen3-30B-A3B and Qwen3-4B capabilities (source)

Conclusion

Alibaba’s Qwen3 is a new advancement in the open-source AI development. The suite includes a range of models, from lightweight versions for local deployment to high-capacity models for enterprise applications. Qwen3 introduces hybrid reasoning, allowing it to switch between thinking and non-thinking modes for different tasks. It excels in coding, multilingual understanding, and multimodal tasks, and features agentic capabilities through advanced function calling and integration.