Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, Boris Yangel

TL;DR
This paper introduces a reinforcement learning approach for training large language models to perform multi-turn, interactive software engineering tasks, significantly improving their performance on relevant benchmarks.
Contribution
It presents a novel RL training pipeline with rejection fine-tuning and DAPO, enabling open-weight models to excel in multi-turn software engineering environments.
Findings
Pass@1 increased from 11% to 39% on SWE-bench
Achieved 35% and 31% Pass@1 on SWE-rebench splits
Method is effective with open-weight models for complex tasks
Abstract
Research on applications of reinforcement learning (RL) to large language models has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn Markov decision processes (MDPs), this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Our methodology begins with rejection fine-tuning (RFT) using execution feedback to train a policy to follow instructions and formatting effectively, followed by a synchronous RL pipeline using…
Peer Reviews
Decision·Submitted to ICLR 2026
- This paper presents a clear, end-to-end agent training for multi-turn SWE and clear problem formulation with POMDP. Also, the presented two-phase recipe is standard yet effective, RFT follows by RL. - I like the transparency of the good engineering and the negative results about the decoding mismatch part. The paper also cautions against decarding over-long trajectories (could hide looping and make the model not-generalizable). This provides good valuable guidance to the commnuity and rarely d
- The core ingredient: Rejection Sampling Finetuning (RFT) followed by on-policy RL is a well-executed application to various tasks with verifiable reward such as math reasoning task and single-turn/multi-turn code generation task. The algorithm design choice DAPO/GRPO is known and the new bit is tailoring the reward shaping to turn count and getting a big model to 131k context length stably. While the application and scaling are novel and successful, I’m missing a sharper “what’s truly new vs.
1. The use of RL to train agents for multi-turn, stateful interactions is highly relevant for advancing LLM-based applications in real-world domains such as SWE. The focus on a structured RL pipeline for this problem is valuable and timely. 2. The combination of RFT and DAPO appears to be effective, as evidenced by the substantial improvements in Pass@1 scores on SWE-bench Verified and SWE-rebench. These results demonstrate the practicality of the approach, particularly when using open-weight mo
1. While the results are promising, the novelty of DAPO as a contribution is not entirely clear. The paper would benefit from a clearer comparison with concurrent or prior approaches to clarify how DAPO differs from related methods. This is particularly important because RL for LLMs is a rapidly evolving field, and comparisons with recent work may be necessary to establish the significance of the contribution. 2. The paper's presentation could be improved. For example: The novelty of the work is
1. The research topic is clear, practical and timeliness. The research gap that most LLM-with-RL studies focus on single-turn tasks, while many real-world tasks like SWE require long-horizon and interactive reasoning. To address this gap, this paper provides a reasonable design for the practical SWE scenario.\\ 2. Strong empirical results. The result of experiments shows great improvement on the selected base model (14.5% -> 36.5%), even comparable with larger model (e.g., DeepSeek-V3-0324)
1. Limited novelty in algorithmic design. Both RFT and DAPO are not new. Previous work, such as Yuan et al., (https://arxiv.org/abs/2308.01825) has leveraged this technique to improve reasoning ability, and DAPO is from Yu et al., as cited. Some modifications, such as turn penalty, are more engineering rather than a theoretical innovation 2. Limited ablation to each component. Although in Table.1, we can see improving performance as adding more stages, the effect of each component in the trai
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
