TL;DR
This paper presents a scalable reinforcement learning pipeline for real-world code repair, demonstrating improved performance and reproducibility, but highlighting challenges in environment generalization.
Contribution
Introduces a verifiable, scalable RL pipeline for code fixing and shows that RL improves performance over supervised fine-tuning in real-world settings.
Findings
RL improves code repair accuracy by 7-20%.
Reproducibility is enhanced by dependency pinning.
Models struggle to generalize across different environments.
Abstract
We tackle the challenge of training reliable code-fixing agents in real repositories, where complex builds and shifting dependencies make evaluation unstable. We developed a verifiable pipeline with success defined as post-fix build validation and improved reproducibility across ~1K real issues by pinning dependencies and disabling automatic upgrades. Building on this, we introduced a scalable simplified pipeline for large-scale reinforcement learning (RL). Using this setup, we supervised fine-tuned Qwen3-32B in the full pipeline and applied RL on top of the SFT model in the simplified environment. The SFT model distilled from GPT-4.1 trajectories performs on par while being 56x smaller, and RL added 7-20% absolute gains under matched train-test conditions. "Thinking mode" was on par or worse in our experiments. Both SFT and RL models failed to generalize across environments,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
