SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Jeffrey Jian Ma; Milad Hashemi; Amir Yazdanbakhsh; Kevin Swersky; Ofir Press; Enhui Li; Vijay Janapa Reddi; Parthasarathy Ranganathan

arXiv:2511.06090·cs.SE·November 12, 2025

SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

Jeffrey Jian Ma, Milad Hashemi, Amir Yazdanbakhsh, Kevin Swersky, Ofir Press, Enhui Li, Vijay Janapa Reddi, Parthasarathy Ranganathan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SWE-fficiency, a benchmark for evaluating how well language models can optimize large-scale software repositories on real workloads, highlighting current agents' underperformance in practical performance improvements.

Contribution

The paper presents SWE-fficiency, a new benchmark with an automated pipeline for evaluating repository-level performance optimization, and provides empirical analysis of state-of-the-art agents' limitations.

Findings

01

Agents achieve less than 0.15x the expert speedup

02

Agents struggle with localizing optimization opportunities

03

Agents have difficulty maintaining correctness in edits

Abstract

Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather than how to fix code. We introduce SWE-fficiency, a benchmark for evaluating repository-level performance optimization on real workloads. Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories (e.g., numpy, pandas, scipy): given a complete codebase and a slow workload, an agent must investigate code semantics, localize bottlenecks and relevant tests, and produce a patch that matches or exceeds expert speedup while passing the same unit tests. To enable this how-to-fix evaluation, our automated pipeline scrapes GitHub pull requests for performance-improving edits, combining keyword filtering, static…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 0Confidence 4

Strengths

- The paper was a joy to read and I commend the authors for going the extra mile in making informative figures and eliciting insightful research questions!

Weaknesses

**Desiderata of an ML Benchmark:** The introduction of `[7]` lists several requirements of the desiderata of a good ML benchmark. To summarize, they claim that a useful ML benchmark is one that is both difficult and realistic -- i.e. the tasks should be challenging for frontier models and agent evaluations while also ensuring the task is realistic. Without both features, the usefulness of a benchmark is severely hampered. For performance benchmarking, one such desiderata elucidated by one of the

Reviewer 02Rating 8Confidence 5

Strengths

- Thorough analysis of the performance of current state-of-the-art models, revealing key weaknesses. - Insightful analysis of agent's solutions - Well-designed evaluation framework, authors make their best effort to make a reproducible and reliable framework - Manual work to ensure benchmark data verifiable and achievable

Weaknesses

- Weakness1: Potential for Workload Oversimplification and Overfitting: A significant concern is that the benchmark relies on singular, self-contained workload scripts to represent performance issues. Real-world software performance is often highly sensitive to the context, scale, and statistical distribution of the input data. The benchmark's scripts, while curated, may not fully capture this complexity. This creates a critical risk: an agent could generate a patch that is perfectly "optimized"

Reviewer 03Rating 4Confidence 4

Strengths

The paper attempts to evaluate the real-world efficiency of LLMs within a complete software engineering workflow. The dataset construction pipeline is clearly designed, and the evaluation metrics are appropriately expanded beyond correctness. The experimental results, which analyze factors such as task complexity and contextual understanding, provide valuable insights for understanding the practical capabilities of LLMs in software development.

Weaknesses

1. The experimental evaluation is limited to four LLMs, which restricts the generality of the findings and makes it difficult to assess how well the benchmark scales across different model families. 2. The paper lacks deeper discussion of the experimental results—for example, why certain models perform better on specific tasks or how task characteristics affect efficiency

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Scientific Computing and Data Management · Software Engineering Research