From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level
Jia Li, Yuxin Su, Michael R. Lyu

TL;DR
This paper introduces RepoReason, a new benchmark for evaluating large language models' ability to reason across complex, real-world code repositories, emphasizing logical consistency and integration.
Contribution
It presents a novel white-box diagnostic benchmark with an execution-driven mutation framework and a fine-grained reasoning metric system for repository-level code understanding.
Findings
Frontier models show significant deficits in integration width.
The benchmark reveals that integration depth is a key cognitive bottleneck.
Granular insights can guide the development of more capable agentic models.
Abstract
As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical. Current benchmarks typically fluctuate between isolated code snippets and black-box evaluations. We present RepoReason, a white-box diagnostic benchmark centered on abductive assertion verification. To eliminate memorization while preserving authentic logical depth, we implement an execution-driven mutation framework that utilizes the environment as a semantic oracle to regenerate ground-truth states. Furthermore, we establish a fine-grained diagnostic system using dynamic program slicing, quantifying reasoning via three orthogonal metrics: (reading load), (simulation depth), and (integration width). Comprehensive evaluations of frontier models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
