When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents
Lu Yan, Xuan Chen, Xiangyu Zhang

TL;DR
This paper introduces SLUMP, a benchmark for evaluating how well long-horizon coding agents maintain faithfulness when the task specification emerges gradually through interaction, highlighting challenges and potential mitigation strategies.
Contribution
It presents a novel benchmark and evaluation methodology for faithfulness loss in emergent specification settings, along with a case study of a mitigation approach called ProjectGuard.
Findings
Structural integration degrades under emergent specification.
Semantic faithfulness loss varies between models, being substantial on Claude Code.
ProjectGuard significantly recovers faithfulness and reduces failures.
Abstract
Current coding-agent benchmarks usually pro- vide the full task specification upfront. Real research coding often does not: the intended system is progressively disclosed through in- teraction, requiring the agent to track durable design commitments across a long session. We introduce a benchmark for this setting and study faithfulne Ss Loss U nder eM ergent s Pecification (SLUMP), defined as the reduc- tion in final implementation faithfulness un- der emergent specification relative to a single- shot specification control. The benchmark con- tains 20 recent ML papers (10 ICML 2025, 10 NeurIPS 2025), 371 atomic verifiable compo- nents, and interaction scripts of approximately 60 coding requests that progressively disclose the target design without revealing the paper itself. Final repositories are scored with a five-level component-faithfulness rubric and accompanied by an exposure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Software Engineering Research · Machine Learning in Materials Science
