Does SWE-Bench-Verified Test Agent Ability or Model Memory?
Thanosan Prathifkumar, Noble Saji Mathews, Meiyappan Nagappan

TL;DR
This paper investigates whether the SWE-Bench-Verified dataset truly measures model ability or if scores are inflated due to training data overlap, highlighting the need for contamination-aware benchmarks.
Contribution
It demonstrates that models perform significantly better on SWE-Bench-Verified likely due to training data overlap, questioning the benchmark's validity for assessing genuine problem-solving skills.
Findings
Models perform 3 times better on SWE-Bench-Verified than on other benchmarks.
Models are 6 times better at finding edited files without additional context.
Scores may reflect training recall rather than actual problem-solving ability.
Abstract
SWE-Bench-Verified, a dataset comprising 500 issues, serves as a de facto benchmark for evaluating various large language models (LLMs) on their ability to resolve GitHub issues. But this benchmark may overlap with model training data. If that is true, scores may reflect training recall, not issue-solving skill. To study this, we test two Claude models that frequently appear in top-performing agents submitted to the benchmark. We ask them to find relevant files using only issue text, and then issue text plus file paths. We then run the same setup on BeetleBox and SWE-rebench. Despite both benchmarks involving popular open-source Python projects, models performed 3 times better on SWE-Bench-Verified. They were also 6 times better at finding edited files, without any additional context about the projects themselves. This gap suggests the models may have seen many SWE-Bench-Verified tasks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Software Engineering Research · Natural Language Processing Techniques
