TL;DR
TEBench is a novel project-level benchmark for evaluating automated test evolution systems, focusing on identifying and updating tests as code changes over time, revealing shared performance limits across models.
Contribution
Introduces TEBench, the first project-level benchmark for test evolution, with a comprehensive pipeline and ground truth annotations for evaluating systems.
Findings
All configurations achieve similar identification F1 scores (~46-49%)
Test-Stale detection remains the most challenging task
Generated test modifications are highly executable but often diverge from ground truth
Abstract
As production code evolves, the test suite must co-evolve to remain effective. Existing benchmarks for test evolution operate at method-level granularity with pre-paired inputs, bypassing the task of locating affected tests from the full project and excluding the need for new tests entirely. We present TEBench, the first project-level benchmark for test evolution. Given a project repository and a code-changing commit, TEBench requires systems to autonomously identify tests requiring modification, determine where new tests are needed, and produce the corresponding test patch. We construct TEBench through a four-stage pipeline over Defects4J projects, curating 314 task instances from 10 projects with developer-written ground truth. Each instance is annotated with one or more of three evolution types: Test-Breaking (tests that fail), Test-Stale (tests that pass but no longer meaningfully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
