Coding Agents Don't Know When to Act
Thibaud Gloaguen, Niels M\"undler, Mark M\"uller, Veselin Raychev, Martin Vechev

TL;DR
This paper evaluates whether current coding agents can recognize when not to act on stale bug reports, revealing they often propose unnecessary changes due to an action bias, highlighting a need for better training strategies.
Contribution
The paper introduces FixedBench, a benchmark with 200 tasks to systematically assess if coding agents know when to abstain from unnecessary modifications, exposing their limitations.
Findings
State-of-the-art models propose unnecessary code changes in 35-65% of cases.
Explicit instructions to verify issues before patching partially reduce errors but cause new failures.
LLMs tend to act even when inaction would be more appropriate, indicating an action bias.
Abstract
Coding agents are increasingly deployed to autonomously maintain software, including to resolve user-reported issues: a bug report comes in and the agent creates a patch to address it. However, in any real-world deployment, they will encounter stale bug reports about issues that have already been resolved. Agents should recognize this and abstain from modifying the code to avoid accumulating technical debt. To systematically evaluate whether current agents do so, we introduce FixedBench, a code benchmark with 200 human-verified coding tasks in which no code changes are required, testing five recent models across four agent harnesses. We find that even state-of-the-art models fail, proposing undesirable changes (excluding tests and documentation) in to of cases. Explicit instructions to reproduce the issue before patching partially address this issue but introduce a new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
