TL;DR
SlopCodeBench is a new benchmark that measures how coding agents' solutions degrade over long iterative tasks, revealing persistent issues like increased complexity and redundancy.
Contribution
It introduces a benchmark with evolving specifications that allows for measuring structural erosion and verbosity in agent-generated code over multiple iterations.
Findings
No agent fully solves any problem end-to-end.
The best agent passes only 14.8% of checkpoints.
Code quality degrades with increased complexity and redundancy.
Abstract
Software development is iterative, yet agentic coding benchmarks hide design issues through their single-shot setup. Recent iterative benchmarks attempt to remedy this but heavily constrain an agent's design decision space, making it impossible to faithfully measure how their decisions shape future extensions. We introduce SlopCodeBench, a benchmark of 36 problems and 196 checkpoints where agents repeatedly extend their own solutions. Unlike prior iterative benchmarks, our evolving specifications demand architectural decisions but leave internal structure to the agent. We measure two forms of degradation: structural erosion (concentrated complexity) and verbosity (redundant code). Evaluating 15 coding agents across open and closed models, we find that no agent fully solves any problem end-to-end, and the best agent passes 14.8% of checkpoints. Quality degrades across checkpoints, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
