SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding
Songcheng Cai, Zhiheng Lyu, Yuansheng Ni, Xiangchao Chen, Baichuan Zhou, Shenzhe Zhu, Yi Lu, Haozhe Wang, Chi Ruan, Benjamin Schneider, Weixu Zhang, Xiang Li, Andy Zheng, Yuyu Zhang, Ping Nie, Wenhu Chen

TL;DR
This paper introduces SWE-QA-Pro, a comprehensive benchmark for repository-level code understanding, and a scalable training pipeline that enhances small models' ability to perform complex agentic tasks, bridging the gap to larger models.
Contribution
We created a diverse, balanced benchmark for repository-level code understanding and developed a scalable training recipe combining supervised fine-tuning and reinforcement learning to improve model capabilities.
Findings
Agentic workflows outperform direct answer baselines by ~13 points.
Qwen3-8B trained with our method surpasses GPT-4o by 2.3 points on SWE-QA-Pro.
Our approach narrows the performance gap to state-of-the-art proprietary models.
Abstract
Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Software Engineering Techniques and Practices
