A Benchmark for Evaluating Repository-Level Code Agents with Intermediate Reasoning on Feature Addition Task
Shuhan Liu, Zhiyi Zhao, Xing Hu, Kui Liu, Xiaohu Yang, Xin Xia

TL;DR
This paper introduces RACE-bench, a new benchmark with intermediate reasoning annotations for evaluating repository-level code agents on feature addition tasks, revealing insights into their reasoning capabilities beyond final correctness.
Contribution
The paper presents RACE-bench, a reasoning-augmented benchmark with a dual-track evaluation framework for assessing code agents' reasoning and patch correctness.
Findings
Agents perform well at understanding high-level intent but struggle with implementation steps.
Reasoning recall drops by 35.7% in apply-success but test-fail cases.
Test failures are associated with 94.1% higher over-prediction.
Abstract
Repository-level code agents have shown strong promise in real-world feature addition tasks, making reliable evaluation of their capabilities increasingly important. However, existing benchmarks primarily evaluate these agents as black boxes based on final test correctness, providing limited insight into how they reason and where failures arise. To address this limitation, we introduce RACE-bench, a reasoning-augmented benchmark for evaluating code agents on repository-level feature addition tasks. RACE-bench contains 528 real-world feature addition instances from 12 open-source repositories. Each instance is paired with executable patch verification and structured intermediate reasoning ground truth covering issue understanding, file localization, implementation tasks, and step decomposition. Based on this design, we introduce a dual-track evaluation framework that jointly measures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
