Are Benchmark Tests Strong Enough? Mutation-Guided Diagnosis and Augmentation of Regression Suites

Chenglin Li; Yisen Xu; Zehao Wang; Shin Hwei Tan; Tse-Hsun (Peter) Chen

arXiv:2604.01518·cs.SE·April 3, 2026

Are Benchmark Tests Strong Enough? Mutation-Guided Diagnosis and Augmentation of Regression Suites

Chenglin Li, Yisen Xu, Zehao Wang, Shin Hwei Tan, Tse-Hsun (Peter) Chen

PDF

TL;DR

This paper introduces STING, a framework that enhances regression test suites using semantic variants to better evaluate automated patching tools, revealing many previously passing patches exploit weak tests.

Contribution

STING systematically uncovers and repairs weaknesses in benchmark test suites by generating targeted, behavior-preserving tests that reveal under-constrained patches.

Findings

01

77% of benchmark instances had surviving variants

02

STING generated 1,014 validated tests

03

Strengthened suites lowered patch success rates by 4.2%-9.0%

Abstract

Benchmarks driven by test suites, notably SWE-bench, have become the de facto standard for measuring the effectiveness of automated issue-resolution agents: a generated patch is accepted whenever it passes the accompanying regression tests. In practice, however, insufficiently strong test suites can admit plausible yet semantically incorrect patches, inflating reported success rates. We introduce STING, a framework for targeted test augmentation that uses semantically altered program variants as diagnostic stressors to uncover and repair weaknesses in benchmark regression suites. Variants of the ground-truth patch that still pass the existing tests reveal under-constrained behaviors; these gaps then guide the generation of focused regression tests. A generated test is retained only if it (i) passes on the ground-truth patch, (ii) fails on at least one variant that survived the original…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.