SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei,, Ofir Press, Karthik Narasimhan

TL;DR
SWE-bench is a new evaluation framework using real GitHub issues to test language models' ability to understand and resolve complex software engineering problems, revealing current models' limited capabilities.
Contribution
Introduction of SWE-bench, a comprehensive benchmark with real-world problems to evaluate language models' effectiveness in software engineering tasks.
Findings
State-of-the-art models resolve only simple issues.
Claude 2 solves 1.96% of issues.
Models struggle with complex, multi-faceted problems.
Abstract
Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and challenging testbed for evaluating the next generation of language models. To this end, we introduce SWE-bench, an evaluation framework consisting of software engineering problems drawn from real GitHub issues and corresponding pull requests across popular Python repositories. Given a codebase along with a description of an issue to be resolved, a language model is tasked with editing the codebase to address the issue. Resolving issues in SWE-bench frequently requires understanding and coordinating changes across multiple functions, classes, and even files simultaneously, calling for models to interact with execution environments, process…
Peer Reviews
Decision·ICLR 2024 oral
The paper addresses a practically relevant issue, that of a benchmark for evaluating language models. The paper is clearly written, and quite a lot of work seems to have been done to support the material in the paper.
It seems that none of the models is doing well when the benchmark is used. It would be nice if the benchmark can be used to more clearly indicate where the problem in the language model lies. The results of the model evaluation e.g. difficulty correlates with context length or difficulty correlates with output length are expected and thus do not seem very interesting
The primary contribution of this paper is the creation of a new dataset and methodology for evaluating the performance of LLMs on real-world software engineering tasks. The benchmark is well-designed, and can be continually updated and expanded moving forward. The experiments with existing models are interesting, but they mainly serve to illustrate that this is a difficult and unsolved problem. I fully expect this to be a high-impact paper, because other practitioners working in this area c
Generating a patch file, and generating code, are two very different tasks. Existing models are pretrained on code, not patch files, so at least some of the poor performance could simply be due to the fact that the models are operating out of distribution on this data set. (The authors mention this issue in the paper.)
- Authors present a good real-world problem benchmark based on real product sized GitHub repositories and real issues fixed in them. - Fine tune CodeLlama 7B and 13B models to get at least somewhat positive performance on repository-wide code edits - Propose retrieval methods to compose input for LLMs to fit into LLM context size. - Evaluate LLMs on the benchmark and present general lessons from the results.
- Although benchmark and LLM evaluation on it are valuable, the paper does not present any novel solutions to the task in the benchmark. This limits the contribution. - Please reorganize the paper so tables and figures are collocated with the text. Currently, it is hard to read when tables referenced out of order and explained very far from their location in the paper.
Code & Models
Videos
AI Agents Take the Wheel: Devin, SIMA, Figure 01 and The Future of Jobs· youtube
Taxonomy
TopicsSoftware System Performance and Reliability · Software Engineering Research · Topic Modeling
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Softmax · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection
