GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Manish Shetty; Naman Jain; Jinjian Liu; Vijay Kethanaboyina; Koushik Sen; Ion Stoica

arXiv:2505.23671·cs.SE·October 28, 2025

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica

PDF

1 Repo 1 Datasets 1 Video

TL;DR

GSO is a benchmark designed to evaluate language models' ability to develop high-performance software, revealing significant challenges faced by current SWE-Agents in optimization tasks across diverse codebases.

Contribution

The paper introduces GSO, a novel benchmark with an automated pipeline for generating and testing optimization tasks, highlighting the limitations of existing SWE-Agents.

Findings

01

Leading SWE-Agents achieve less than 5% success rate.

02

Current agents show limited improvements even with inference scaling.

03

Key failure modes include difficulties with low-level languages and bottleneck localization.

Abstract

Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gso-bench/gso
noneOfficial

Datasets

gso-bench/gso
dataset· 2.7k dl
2.7k dl

Videos

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents· slideslive