SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?
Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, Zejun Ma

TL;DR
This paper introduces SWE-Perf, a benchmark for evaluating large language models' ability to optimize code performance in real-world repositories, revealing significant gaps compared to expert solutions.
Contribution
SWE-Perf is the first benchmark specifically designed to assess LLMs on code performance optimization in authentic repositories, providing a systematic evaluation framework.
Findings
Existing LLMs show a large performance gap compared to experts.
SWE-Perf includes 140 curated instances from real GitHub repositories.
Evaluation reveals critical research opportunities in code optimization with LLMs.
Abstract
Code performance optimization is paramount in real-world software engineering and critical for production-level systems. While Large Language Models (LLMs) have demonstrated impressive capabilities in code generation and bug fixing, their proficiency in enhancing code performance at the repository level remains largely unexplored. To address this gap, we introduce SWE-Perf, the first benchmark specifically designed to systematically evaluate LLMs on code performance optimization tasks within authentic repository contexts. SWE-Perf comprises 140 carefully curated instances, each derived from performance-improving pull requests from popular GitHub repositories. Each benchmark instance includes the relevant codebase, target functions, performance-related tests, expert-authored patches, and executable environments. Through a comprehensive evaluation of representative methods that span…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Realistic benchmark: The authors adopt well-established pipelines in SWE-Bench and SWE-Gym to filter PRs among 12 real-world Python code repositories. Furthermore, by distinguishing between file-level and repo–level agentic settings, the benchmark captures both targeted (potentially algorithmic) and system-wide optimizations. 2. Empirical evaluation includes both pipeline-based and agent-based paradigms, further decoupling correctness from performance.
1. Test coverage identification: In this work, the authors select only unit tests directly tied to performance optimization, which may not represent the full set of tests relevant to a target function or patch. This, in turn, could skew the empirical findings altogether. For instance, a loose set of related tests can be estimated from the static call graph by looking at the test coverage, and mapping it to all the functions covered in a patch. For more precision, the dynamic call graphs can be s
- The data collection and curation process is exceptionally thorough. The multi-phase pipeline, which includes executing tests, ensuring reproducibility in a containerized environment, and using statistical tests (Mann-Whitney U test) to confirm performance improvements, lends high credibility and quality to the resulting dataset. - Authors provides deeper analysis into how performance varies with the complexity of the task (e.g., number of target functions, original runtime) and offers qualitat
- Limited Dataset Size and Generalizability: The final dataset contains 140 instances from 9 Python repositories. While the curation process justifies the small size, it may limit the statistical power and generalizability of the findings. Performance on these popular repositories might not be representative of performance on other languages or less common software projects. The authors acknowledge this limitation. - Reliance on Unit Tests: This is a key methodological limitation. While using un
1. The paper extends code performance optimization from the function level to realistic repository-level settings. This design assesses LLMs’ ability not only to optimize a single code function or algorithm but also to retrieve relevant snippets and locate performance bottlenecks within large, complex codebases. 2. The authors design a two-stage evaluation setup consisting of the Oracle and Realistic settings, in file-level and repository-level respectively. This hierarchical design provides a
1. SWE-Perf does not account for differences in repository application domains, structural characteristics, or development maturity. In real-world software systems, these factors substantially influence performance requirements, optimization strategy and optimization potential. However, SWE-Perf fails to consider such contextual performance ceilings and optimization structure / strategy differences, while simply mixing the instances from different domains / repositories with different features t
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Semantic Web and Ontologies · Software Engineering Research
