Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems
Matias Martinez, Xavier Franch

TL;DR
This paper provides a comprehensive analysis of submissions to the SWE-Bench leaderboards, revealing insights into the architectures, submitter types, and LLM usage in automated program repair systems.
Contribution
It is the first detailed study of SWE-Bench submissions, profiling 179 entries to understand design choices and contributor diversity.
Findings
Proprietary LLMs, especially Claude 3.5, dominate the solutions.
Both agentic and non-agentic system architectures are present.
Contributor base includes individuals and large tech companies.
Abstract
The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards -- SWE-Bench Lite and SWE-Bench Verified -- have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (79 entries) and Verified (99 entries) leaderboards, analyzing 80 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Software Testing and Debugging Techniques · Scientific Computing and Data Management
MethodsBalanced Selection
