Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits

Haichao Zhu; Qian Zhang; Jiyuan Wang; Zhaorui Yang; Yuxin Qiu

arXiv:2603.27745·cs.SE·March 31, 2026

Needle in the Repo: A Benchmark for Maintainability in AI-Generated Repository Edits

Haichao Zhu, Qian Zhang, Jiyuan Wang, Zhaorui Yang, Yuxin Qiu

PDF

TL;DR

This paper introduces NITR, a benchmark framework for evaluating the maintainability of AI-generated code edits, revealing current AI systems' struggles with architectural quality and structural constraints.

Contribution

The paper presents NITR, a diagnostic framework that assesses AI code edits for maintainability, highlighting significant gaps in current AI coding systems' ability to produce structurally sound code.

Findings

01

AI systems solve only 36.2% of maintainability probes on average.

02

Performance drops significantly on multi-step cases, down to 20.6%.

03

Structural failures occur even when functional tests pass, indicating maintainability issues.

Abstract

AI coding agents can now complete complex programming tasks, but existing evaluations largely emphasize behavioral correctness and often overlook maintainability risks such as weak modularity or testability. We present Needle in the Repo (NITR), a diagnostic probe-and-oracle framework for evaluating whether behaviorally correct repository edits preserve maintainable structure. NITR distills recurring software engineering wisdom into controlled probes embedded in small, realistic multi-file codebases, each designed so that success depends primarily on one targeted maintainability dimension. Each probe is paired with a hidden evaluation harness that combines functional tests for required behavior with structural oracles that encode the targeted maintainability constraint and return interpretable diagnoses. Using NITR, we evaluate 23 coding configurations across GPT, Claude, Gemini, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.