Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution

Ye Shang; Quanjun Zhang; Haichuan Hu; Chunrong Fang; Liang Xiao; Zhenyu Chen

arXiv:2605.06125·cs.SE·May 8, 2026

Breaking, Stale, or Missing? Benchmarking Coding Agents on Project-Level Test Evolution

Ye Shang, Quanjun Zhang, Haichuan Hu, Chunrong Fang, Liang Xiao, Zhenyu Chen

PDF

1 Repo

TL;DR

TEBench is a novel project-level benchmark for evaluating automated test evolution systems, focusing on identifying and updating tests as code changes over time, revealing shared performance limits across models.

Contribution

Introduces TEBench, the first project-level benchmark for test evolution, with a comprehensive pipeline and ground truth annotations for evaluating systems.

Findings

01

All configurations achieve similar identification F1 scores (~46-49%)

02

Test-Stale detection remains the most challenging task

03

Generated test modifications are highly executable but often diverge from ground truth

Abstract

As production code evolves, the test suite must co-evolve to remain effective. Existing benchmarks for test evolution operate at method-level granularity with pre-paired inputs, bypassing the task of locating affected tests from the full project and excluding the need for new tests entirely. We present TEBench, the first project-level benchmark for test evolution. Given a project repository and a code-changing commit, TEBench requires systems to autonomously identify tests requiring modification, determine where new tests are needed, and produce the corresponding test patch. We construct TEBench through a four-stage pipeline over Defects4J projects, curating 314 task instances from 10 projects with developer-written ground truth. Each instance is annotated with one or more of three evolution types: Test-Breaking (tests that fail), Test-Stale (tests that pass but no longer meaningfully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iSEngLab/TEBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.