GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica

TL;DR
GSO is a benchmark designed to evaluate language models' ability to develop high-performance software, revealing significant challenges faced by current SWE-Agents in optimization tasks across diverse codebases.
Contribution
The paper introduces GSO, a novel benchmark with an automated pipeline for generating and testing optimization tasks, highlighting the limitations of existing SWE-Agents.
Findings
Leading SWE-Agents achieve less than 5% success rate.
Current agents show limited improvements even with inference scaling.
Key failure modes include difficulties with low-level languages and bottleneck localization.
Abstract
Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
