SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

Songcheng Cai; Zhiheng Lyu; Yuansheng Ni; Xiangchao Chen; Baichuan Zhou; Shenzhe Zhu; Yi Lu; Haozhe Wang; Chi Ruan; Benjamin Schneider; Weixu Zhang; Xiang Li; Andy Zheng; Yuyu Zhang; Ping Nie; Wenhu Chen

arXiv:2603.16124·cs.SE·March 18, 2026

SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

Songcheng Cai, Zhiheng Lyu, Yuansheng Ni, Xiangchao Chen, Baichuan Zhou, Shenzhe Zhu, Yi Lu, Haozhe Wang, Chi Ruan, Benjamin Schneider, Weixu Zhang, Xiang Li, Andy Zheng, Yuyu Zhang, Ping Nie, Wenhu Chen

PDF

Open Access 1 Datasets

TL;DR

This paper introduces SWE-QA-Pro, a comprehensive benchmark for repository-level code understanding, and a scalable training pipeline that enhances small models' ability to perform complex agentic tasks, bridging the gap to larger models.

Contribution

We created a diverse, balanced benchmark for repository-level code understanding and developed a scalable training recipe combining supervised fine-tuning and reinforcement learning to improve model capabilities.

Findings

01

Agentic workflows outperform direct answer baselines by ~13 points.

02

Qwen3-8B trained with our method surpasses GPT-4o by 2.3 points on SWE-QA-Pro.

03

Our approach narrows the performance gap to state-of-the-art proprietary models.

Abstract

Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

TIGER-Lab/SWE-QA-Pro-Bench
dataset· 52 dl
52 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Software Engineering Techniques and Practices