Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits
Haichao Zhu, Qian Zhang, Jiyuan Wang, Zhaorui Yang, Yuxin Qiu

TL;DR
This paper introduces NITR, a benchmark framework for evaluating the maintainability of AI-generated code edits, revealing current AI systems' struggles with architectural quality and structural constraints.
Contribution
The paper presents NITR, a diagnostic framework that assesses AI code edits for maintainability, highlighting significant gaps in current AI coding systems' ability to produce structurally sound code.
Findings
AI systems solve only 36.2% of maintainability probes on average.
Performance drops significantly on multi-step cases, down to 20.6%.
Structural failures occur even when functional tests pass, indicating maintainability issues.
Abstract
AI coding agents can now complete complex programming tasks, but existing evaluations largely emphasize behavioral correctness and often overlook maintainability risks such as weak modularity or testability. We present Needle in the Repo (NITR), a diagnostic probe-and-oracle framework for evaluating whether behaviorally correct repository edits preserve maintainable structure. NITR distills recurring software engineering wisdom into controlled probes embedded in small, realistic multi-file codebases, each designed so that success depends primarily on one targeted maintainability dimension. Each probe is paired with a hidden evaluation harness that combines functional tests for required behavior with structural oracles that encode the targeted maintainability constraint and return interpretable diagnoses. Using NITR, we evaluate 23 coding configurations across GPT, Claude, Gemini, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
