What's in a Benchmark? The Case of SWE-Bench in Automated Program Repair
Matias Martinez, Xavier Franch

TL;DR
This paper analyzes the SWE-Bench benchmarks for Automated Program Repair, revealing industry dominance, proprietary LLM usage, and the competitive landscape, providing insights to enhance transparency and diversity in future research.
Contribution
It offers the first comprehensive analysis of SWE-Bench leaderboards, highlighting submission sources, LLM usage, and ecosystem dynamics in automated program repair.
Findings
Most submissions come from industry, especially small and large companies.
Proprietary LLMs, notably Claude 4 Sonnet, dominate top results.
Academic open-source contributions remain competitive.
Abstract
The rapid progress in Automated Program Repair (APR) has been fueled by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a benchmark designed to evaluate repair systems using real issues mined from popular open-source Python repositories. Its public leaderboards-SWE-Bench Lite and Verified-have become central platforms for tracking progress and comparing solutions. In this paper, we present the first comprehensive study of these two leaderboards, examining who is submitting solutions, the products behind the submissions, the LLMs employed, and the openness of the approaches. We analyze 79 entries submitted to Lite leaderboard and 133 to Verified. Our results show that most entries on both leaderboards originate from industry, particularly small companies and large publicly traded companies. These submissions often achieve top results,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Adversarial Robustness in Machine Learning
