SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

Gabriel Orlanski; Devjeet Roy; Alexander Yun; Changho Shin; Alex Gu; Albert Ge; Dyah Adila; Nicholas Roberts; Frederic Sala; Aws Albarghouthi

arXiv:2603.24755·cs.SE·May 11, 2026

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

Gabriel Orlanski, Devjeet Roy, Alexander Yun, Changho Shin, Alex Gu, Albert Ge, Dyah Adila, Nicholas Roberts, Frederic Sala, Aws Albarghouthi

PDF

1 Repo

TL;DR

SlopCodeBench is a new benchmark that measures how coding agents' solutions degrade over long iterative tasks, revealing persistent issues like increased complexity and redundancy.

Contribution

It introduces a benchmark with evolving specifications that allows for measuring structural erosion and verbosity in agent-generated code over multiple iterations.

Findings

01

No agent fully solves any problem end-to-end.

02

The best agent passes only 14.8% of checkpoints.

03

Code quality degrades with increased complexity and redundancy.

Abstract

Software development is iterative, yet agentic coding benchmarks hide design issues through their single-shot setup. Recent iterative benchmarks attempt to remedy this but heavily constrain an agent's design decision space, making it impossible to faithfully measure how their decisions shape future extensions. We introduce SlopCodeBench, a benchmark of 36 problems and 196 checkpoints where agents repeatedly extend their own solutions. Unlike prior iterative benchmarks, our evolving specifications demand architectural decisions but leave internal structure to the agent. We measure two forms of degradation: structural erosion (concentrated complexity) and verbosity (redundant code). Evaluating 15 coding agents across open and closed models, we find that no agent fully solves any problem end-to-end, and the best agent passes 14.8% of checkpoints. Quality degrades across checkpoints, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sprocketlab/slop-code-bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.