Meta introduces SWE-RL, marking the first time reinforcement learning has been used to improve the coding capabilities of LLMs by training them on real-world open-source code repositories.
SWE-RL is an open-source project available on GitHub. Its repository includes source code, documentation, and setup instructions. To get started, simply clone the repository and follow the provided guidelines.
Reinforcement learning (RL) increases the general LLM reasoning, as seen in DeepSeek-R1‘s impressive performance. Powered by RL and an extensive 671B parameters, DeepSeek-R1 excels in competitive coding and mathematical problem-solving. However, its effectiveness in real-world software engineering (SWE) tasks remains limited as these tasks often involve dependencies, system interactions, long-term effects, and unclear correctness criteria.
SWE-RL addresses this limitation by combining software evolution data with rule-based rewards to create models that think like a software developer. They not only showcase strong coding skills, but also improve by learning from past experiences.
Specifically, Llama3-SWE-RL-70B, developed by training Llama-3.3-70B-Instruct with SWE-RL, demonstrates outstanding coding capabilities. With a 41.0% resolve rate on SWE-bench Verified, it surpasses all other medium-sized language models (<100B parameters) and competes with leading proprietary models like GPT-4o.
Key contributions
- SWE-RL trains LLMs on GitHub’s complete software history, including code snapshots, changes, issues, and pull requests (PRs).
- Llama3-SWE-RL-70B excels in solving real coding problems.
- SWE-RL enhances LLMs beyond coding, improving their math, reasoning, and language understanding capabilities, despite being trained exclusively on software development data and rule-based rewards.
SWE-RL pipeline
SWE-RL uses RL and trains its models on data from GitHub pull requests, which detail how developers modify and improve code. By applying basic reward rules, SWE-RL prioritizes reasoning over simple code generation. Initially, it analyzes the problem described in the pull request, understands the required code changes, and then generates code with contextual understanding. The image below illustrates the 4 main steps of the SWE-RL process.

1. Create an initial RL dataset (the seed). First, a highly curated dataset is created from GitHub pull requests, with the full lifecycle of software development – including issue descriptions, code snapshots, and the correct and verified code changes (oracle patches).
2. Train the policy LLM. Once the initial dataset is prepared, the LLM generates code changes based on the provided issue descriptions and code context. The picture below shows the prompt template used to train Llama3-SWE-RL with SWE-RL.

3. Compute the reward. When the LLM generates a code change that matches the oracle patch, a positive reward is assigned on a scale between 0 and 1 based on its similarity to human-written solutions. Incorrect formatting results in a -1 penalty. This way the model learns from its successes and mistakes, improving coding capabilities over time.
4. Refine model learning. The last step in the SWE-RL process is policy optimization performed by the Generalized Policy Optimization (GRPO) algorithm. GRPO, an advanced RL method, optimizes the policy by selecting actions with the highest rewards. Through this iterative process, the model progressively enhances its ability to generate high-quality code changes.
“Aha moments” and generalized reasoning capabilities. The RL approach in SWE-RL enables the development of emergent reasoning skills, often referred to as “aha moments,” similar to those observed in DeepSeek-R1. These skills include self-reflection, exploring alternative solutions, and employing divide-and-conquer strategies, which contribute to the model’s strong performance in a variety of tasks. Notably, the model excels not only in in-domain tasks like issue-solving but also in out-of-domain tasks such as function implementation and mathematical reasoning.
Data curation process
The Llama3-SWE-RL model was trained using a curated dataset of PRs from the GitHub Archive. The figure below illustrates SWE-RL’s raw PR data curation process.

- GitHub events and clones: 4.6M repositories were cloned, including their complete commit history from January 1, 2015, to August 31, 2024. To ensure an unbiased evaluation of SWE-RL, researchers excluded the repositories used in the existing SWE benchmark.
- PR data aggregation: They aggregated all pertinent information associated with each PR individually. This includes issues, discussions, comments, initial code contents, and subsequent commits and code changes.
- Relevant files prediction: To prevent LLMs from over-editing, which occurred when only modified files were included in PR data, they used Llama-3.1-70B-Instruct to identify and add all relevant, even unmodified files.
- Data filtering: The dataset was cleaned by removing duplicates, low-quality PRs, and incomplete data.
Evaluation
Llama3-SWE-RL-70B successfully resolves 41% of SWE-bench Verified issues (a subset of SWE-bench with 500 human-verified problems), the continuous reward system being a major factor in its strong reasoning abilities. This model demonstrates performance comparable to GPT-4o while surpassing medium-sized open-source models. The table below provides a detailed overview of the model’s performance, which is achieved through exclusive training on public data using the SWE-RL framework.

Conclusion
SWE-RL represents the first RL method to improve LLMs for real-world software engineering tasks. It demonstrates that open-source software evolution data, when combined with RL, can build more capable coding assistants that surpass the existing benchmarks.