From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

Jia Li; Yuxin Su; Michael R. Lyu

arXiv:2601.03731·cs.SE·May 5, 2026

From Laboratory to Real-World Applications: Benchmarking Agentic Code Reasoning at the Repository Level

Jia Li, Yuxin Su, Michael R. Lyu

PDF

TL;DR

This paper introduces RepoReason, a new benchmark for evaluating large language models' ability to reason across complex, real-world code repositories, emphasizing logical consistency and integration.

Contribution

It presents a novel white-box diagnostic benchmark with an execution-driven mutation framework and a fine-grained reasoning metric system for repository-level code understanding.

Findings

01

Frontier models show significant deficits in integration width.

02

The benchmark reveals that integration depth is a key cognitive bottleneck.

03

Granular insights can guide the development of more capable agentic models.

Abstract

As large language models (LLMs) evolve into autonomous agents, evaluating repository-level reasoning, the ability to maintain logical consistency across massive, real-world, interdependent file systems, has become critical. Current benchmarks typically fluctuate between isolated code snippets and black-box evaluations. We present RepoReason, a white-box diagnostic benchmark centered on abductive assertion verification. To eliminate memorization while preserving authentic logical depth, we implement an execution-driven mutation framework that utilizes the environment as a semantic oracle to regenerate ground-truth states. Furthermore, we establish a fine-grained diagnostic system using dynamic program slicing, quantifying reasoning via three orthogonal metrics: $E S V$ (reading load), $M C L$ (simulation depth), and $D F I$ (integration width). Comprehensive evaluations of frontier models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.