Pull Requests as a Training Signal for Repo-Level Code Editing
Qinglin Zhu, Tianyu Chen, Shuai Lu, Lei Ji, Runcong Zhao, Murong Ma, Xiangxiang Dai, Yulan He, Lin Gui, Peng cheng, Yeyun Gong

TL;DR
This paper introduces Clean-PR, a novel training paradigm that leverages real-world GitHub pull requests as a high-quality signal for improving repository-level code editing models, achieving significant performance gains without complex scaffolding.
Contribution
The paper presents a scalable pipeline to convert pull requests into training data and demonstrates that models trained with this signal outperform baselines on SWE-bench tasks.
Findings
Model trained with Clean-PR surpasses baselines by 13.6% on SWE-bench Lite.
Largest corpus of 2 million pull requests used for training.
Effective internalization of repository-level editing capabilities without heavy inference scaffolding.
Abstract
Repository-level code editing requires models to understand complex dependencies and execute precise multi-file modifications across a large codebase. While recent gains on SWE-bench rely heavily on complex agent scaffolding, it remains unclear how much of this capability can be internalised via high-quality training signals. To address this, we propose Clean Pull Request (Clean-PR), a mid-training paradigm that leverages real-world GitHub pull requests as a training signal for repository-level editing. We introduce a scalable pipeline that converts noisy pull request diffs into Search/Replace edit blocks through reconstruction and validation, resulting in the largest publicly available corpus of 2 million pull requests spanning 12 programming languages. Using this training signal, we perform a mid-training stage followed by an agentless-aligned supervised fine-tuning process with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Scientific Computing and Data Management
