Are Benchmark Tests Strong Enough? Mutation-Guided Diagnosis and Augmentation of Regression Suites
Chenglin Li, Yisen Xu, Zehao Wang, Shin Hwei Tan, Tse-Hsun (Peter) Chen

TL;DR
This paper introduces STING, a framework that enhances regression test suites using semantic variants to better evaluate automated patching tools, revealing many previously passing patches exploit weak tests.
Contribution
STING systematically uncovers and repairs weaknesses in benchmark test suites by generating targeted, behavior-preserving tests that reveal under-constrained patches.
Findings
77% of benchmark instances had surviving variants
STING generated 1,014 validated tests
Strengthened suites lowered patch success rates by 4.2%-9.0%
Abstract
Benchmarks driven by test suites, notably SWE-bench, have become the de facto standard for measuring the effectiveness of automated issue-resolution agents: a generated patch is accepted whenever it passes the accompanying regression tests. In practice, however, insufficiently strong test suites can admit plausible yet semantically incorrect patches, inflating reported success rates. We introduce STING, a framework for targeted test augmentation that uses semantically altered program variants as diagnostic stressors to uncover and repair weaknesses in benchmark regression suites. Variants of the ground-truth patch that still pass the existing tests reveal under-constrained behaviors; these gaps then guide the generation of focused regression tests. A generated test is retained only if it (i) passes on the ground-truth patch, (ii) fails on at least one variant that survived the original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
