When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

Lu Yan; Xuan Chen; Xiangyu Zhang

arXiv:2603.17104·cs.SE·March 19, 2026

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

Lu Yan, Xuan Chen, Xiangyu Zhang

PDF

Open Access

TL;DR

This paper introduces SLUMP, a benchmark for evaluating how well long-horizon coding agents maintain faithfulness when the task specification emerges gradually through interaction, highlighting challenges and potential mitigation strategies.

Contribution

It presents a novel benchmark and evaluation methodology for faithfulness loss in emergent specification settings, along with a case study of a mitigation approach called ProjectGuard.

Findings

01

Structural integration degrades under emergent specification.

02

Semantic faithfulness loss varies between models, being substantial on Claude Code.

03

ProjectGuard significantly recovers faithfulness and reduces failures.

Abstract

Current coding-agent benchmarks usually pro- vide the full task specification upfront. Real research coding often does not: the intended system is progressively disclosed through in- teraction, requiring the agent to track durable design commitments across a long session. We introduce a benchmark for this setting and study faithfulne Ss Loss U nder eM ergent s Pecification (SLUMP), defined as the reduc- tion in final implementation faithfulness un- der emergent specification relative to a single- shot specification control. The benchmark con- tains 20 recent ML papers (10 ICML 2025, 10 NeurIPS 2025), 371 atomic verifiable compo- nents, and interaction scripts of approximately 60 coding requests that progressively disclose the target design without revealing the paper itself. Final repositories are scored with a five-level component-faithfulness rubric and accompanied by an exposure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Software Engineering Research · Machine Learning in Materials Science