SWE-Bench+: Enhanced Coding Benchmark for LLMs

Reem Aleithan; Haoran Xue; Mohammad Mahdi Mohajer; Elijah Nnorom; Gias; Uddin; Song Wang

arXiv:2410.06992·cs.SE·October 11, 2024·3 cites

SWE-Bench+: Enhanced Coding Benchmark for LLMs

Reem Aleithan, Haoran Xue, Mohammad Mahdi Mohajer, Elijah Nnorom, Gias, Uddin, Song Wang

PDF

Open Access 1 Datasets 4 Reviews

TL;DR

This paper provides an empirical analysis of the SWE-bench dataset for evaluating LLMs in software engineering, revealing significant data quality issues such as solution leakage and weak test cases that impact the assessment of model performance.

Contribution

It systematically evaluates the quality of the SWE-bench dataset, identifying critical issues and their effects on LLM performance evaluation in software engineering tasks.

Findings

01

32.67% of successful patches involve solution leakage.

02

31.08% of patches are suspicious due to weak test cases.

03

Filtering problematic issues reduces the resolution rate from 12.47% to 3.97%.

Abstract

Large Language Models (LLMs) in Software Engineering (SE) can offer assistance for coding. To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories. Several impressive LLM-based toolkits recently are developed and evaluated on this dataset. However, a systematic evaluation of the quality of SWE-bench remains missing. In this paper, we addressed this gap by presenting an empirical analysis of the SWE-bench dataset. We conducted a manual screening of instances where SWEAgent + GPT-4 successfully resolved issues by comparing the model-generated patches with the actual pull requests. SWE-Agent+GPT-4 was at the top of SWE-bench leaderboard during the time of our study. Our analysis reveals some…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. The authors provide a complete quality analysis based on manual check and empirical study. Moreover, the authors provide high-level insights into the solution leakage and weak test issue, as well as fine-grained example based analysis. 2. The authors provide quantitative identification of critical flaws in the existing benchmarks with sufficient statistics. 3. The authors design two automatic LLM tools - SoLuLeakDetector and TestEnhancer, which are well-motivated and have potential use in o

Weaknesses

1. Although the authors present an in-depth manual analysis on SWE-bench Verified and SWE-bench Lite, it remains unclear how the issue of solution leakage is fairly quantified. In real-world problem-solving scenarios, some issues inherently include partial hints as part of clarifying the task instructions, which may complicate defining and measuring leakage consistently. 2. Given the growing concerns about data leakage in LLMs, many benchmarks evolve and update frequently. However, the authors’

Reviewer 02Rating 4Confidence 4

Strengths

This paper offers a highly significant empirical analysis of the SWE-bench benchmark, a critical standard for evaluating the software engineering capabilities of Large Language Models. The study's key contributions, include: 1. The paper systematically identifies two major flaws undermining the benchmark's reliability: Solution Leakage and Weak Unit Tests. 2. Solution Leakage is further categorized into direct leakage and indirect leakage, providing a nuanced understanding of how bug-fixing inf

Weaknesses

1. The overall writing quality of the paper is somewhat coarse. Notably, there is even a reference in the abstract that cannot be properly redirected. The description of the experimental setup is also vague. For instance, Figure 4 is likely to confuse readers regarding the meaning of “overall,” and its caption appears overly simplistic and underdeveloped. 2. The overall contribution of the paper is rather limited. It primarily focuses on identifying and correcting issues within SWE-bench without

Reviewer 03Rating 2Confidence 5

Strengths

* The study considers 651 submitted patches where tests passed and conducts a manual audit of these cases. This was conducted by all three authors and disagreements were resolved via discussion. * The subsequently mitigation strategies are validated appropriately.

Weaknesses

* In my view, it is not very clear why some amount of “solution leakage” is a fundamental problem in the benchmark. While it does make the benchmark easier, it models realistic cases of users approaching models with suggestions. * Furthermore, the paper’s analysis suffers from a positive evidence bias by only considering passing solutions where the AI-generated patches passed. * While these defects impact the absolute performance numbers, I am not fully convinced that make SWE-bench less usefu

Reviewer 04Rating 2Confidence 3

Strengths

1. Empirical observation of benchmark flaws – The authors perform a manual analysis of SWE-Bench instances and identify two concrete issues: “solution leakage” and weak test cases. 2. Develop two tools: SoluLeakDetector and TestEnhancer. 3. Quantitative analysis – Provides clear before/after comparisons showing performance drops.

Weaknesses

1. Trivial or incremental contribution – The main idea (“filter leaks and add tests with an LLM”) is conceptually straightforward and mostly engineering. 2. Unclear definition – “Solution leak” is defined informally (direct vs. hint leaks) but lacks precise operational criteria or reproducible quantitative thresholding. 3. Limited distinctiveness of SWE-Bench+ vs. SWE-Bench. Although the paper filters leakage cases and strengthens tests, it doesn’t convincingly show that SWE-Bench+ measures new

Code & Models

Datasets

anthonypjshaw/SWE-bench_Complex
dataset· 33 dl
33 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies