SWE-ABS: Adversarial Benchmark Strengthening Exposes Inflated Success Rates on Test-based Benchmark

Boxi Yu; Yang Cao; Yuzhong Zhang; Liting Lin; Junjielong Xu; Zhiqing Zhong; Qinghua Xu; Guancheng Wang; Jialun Cao; Shing-Chi Cheung; Pinjia He; and Lionel Briand

arXiv:2603.00520·cs.SE·March 3, 2026

SWE-ABS: Adversarial Benchmark Strengthening Exposes Inflated Success Rates on Test-based Benchmark

Boxi Yu, Yang Cao, Yuzhong Zhang, Liting Lin, Junjielong Xu, Zhiqing Zhong, Qinghua Xu, Guancheng Wang, Jialun Cao, Shing-Chi Cheung, Pinjia He, and Lionel Briand

PDF

Open Access 2 Datasets

TL;DR

This paper introduces SWE-ABS, an adversarial framework that enhances test suites for code patching benchmarks, revealing inflated success rates and significantly reshuffling leaderboard standings.

Contribution

SWE-ABS is a novel adversarial testing framework that strengthens test suites by targeting untested code regions and synthesizing incorrect patches, exposing semantic blind spots.

Findings

01

Strengthened 50.2% of benchmark instances

02

Rejected 19.71% of previously passing patches

03

Top agent's score dropped from 78.80% to 62.20%

Abstract

The SWE-Bench Verified leaderboard is approaching saturation, with the top system achieving 78.80%. However, we show that this performance is inflated. Our re-evaluation reveals that one in five "solved" patches from the top-30 agents are semantically incorrect, passing only because weak test suites fail to expose their errors. We present SWE-ABS, an adversarial framework that strengthens test suites through a two-stage pipeline: (1) coverage-driven augmentation using program slicing to target untested code regions, and (2) mutation-driven adversarial testing that synthesizes plausible but incorrect patches to expose semantic blind spots. On SWE-Bench Verified (500 instances), SWE-ABS strengthens 50.2% of instances, a 25.1x improvement over prior work, and rejects 19.71% of previously passing patches. As a result, the top agent's score decreases from 78.80% to 62.20%, leading to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Adversarial Robustness in Machine Learning · Security and Verification in Computing