The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason
Shanchao Liang, Spandan Garg, Roshanak Zilouchian Moghaddam

TL;DR
This paper critically examines whether large language models' high performance on SWE-Bench is due to genuine problem-solving or memorization, revealing potential overestimations of their capabilities and the need for more robust evaluation benchmarks.
Contribution
It introduces diagnostic tasks to distinguish memorization from reasoning in LLMs and provides empirical evidence questioning the validity of current benchmark results.
Findings
Models perform well on SWE-Bench due to memorization.
Performance drops significantly on out-of-distribution tasks.
Current benchmarks may overstate LLMs' problem-solving abilities.
Abstract
As large language models (LLMs) become increasingly capable and widely adopted, benchmarks play a central role in assessing their practical utility. For example, SWE-Bench Verified has emerged as a critical benchmark for evaluating LLMs' software engineering abilities, particularly their aptitude for resolving real-world GitHub issues. Recent LLMs show impressive performance on SWE-Bench, leading to optimism about their capacity for complex coding tasks. However, current evaluation protocols may overstate these models' true capabilities. It is crucial to distinguish LLMs' generalizable problem-solving ability and other learned artifacts. In this work, we introduce two diagnostic tasks: file path identification from issue descriptions alone and ground truth function reproduction with only the current file context and issue description to probe models' underlying knowledge. We present…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCorporate Insolvency and Governance · Corporate Governance and Law · European and International Contract Law
