Does SWE-Bench-Verified Test Agent Ability or Model Memory?

Thanosan Prathifkumar; Noble Saji Mathews; Meiyappan Nagappan

arXiv:2512.10218·cs.SE·December 23, 2025

Does SWE-Bench-Verified Test Agent Ability or Model Memory?

Thanosan Prathifkumar, Noble Saji Mathews, Meiyappan Nagappan

PDF

Open Access

TL;DR

This paper investigates whether the SWE-Bench-Verified dataset truly measures model ability or if scores are inflated due to training data overlap, highlighting the need for contamination-aware benchmarks.

Contribution

It demonstrates that models perform significantly better on SWE-Bench-Verified likely due to training data overlap, questioning the benchmark's validity for assessing genuine problem-solving skills.

Findings

01

Models perform 3 times better on SWE-Bench-Verified than on other benchmarks.

02

Models are 6 times better at finding edited files without additional context.

03

Scores may reflect training recall rather than actual problem-solving ability.

Abstract

SWE-Bench-Verified, a dataset comprising 500 issues, serves as a de facto benchmark for evaluating various large language models (LLMs) on their ability to resolve GitHub issues. But this benchmark may overlap with model training data. If that is true, scores may reflect training recall, not issue-solving skill. To study this, we test two Claude models that frequently appear in top-performing agents submitted to the benchmark. We ask them to find relevant files using only issue text, and then issue text plus file paths. We then run the same setup on BeetleBox and SWE-rebench. Despite both benchmarks involving popular open-source Python projects, models performed 3 times better on SWE-Bench-Verified. They were also 6 times better at finding edited files, without any additional context about the projects themselves. This gap suggests the models may have seen many SWE-Bench-Verified tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Software Engineering Research · Natural Language Processing Techniques