SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

Mohit Raghavendra; Soham Dan; Miguel Romero Calvo; Yannis Yiming He; Johannes Baptist Mols; Gautam Anand; Cole McCollum; Edgar Arakelyan; Vijay Bharadwaj; Andrew Park; Jeff Da; MohammadHossein Rezaei; Bing Liu; Brad Kenstler,Yunzhong He

arXiv:2605.08366·cs.LG·May 12, 2026

SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution

Mohit Raghavendra, Soham Dan, Miguel Romero Calvo, Yannis Yiming He, Johannes Baptist Mols, Gautam Anand, Cole McCollum, Edgar Arakelyan, Vijay Bharadwaj, Andrew Park, Jeff Da, MohammadHossein Rezaei, Bing Liu, Brad Kenstler,Yunzhong He

PDF

TL;DR

SWE Atlas is a comprehensive benchmark suite for evaluating coding agents across software engineering tasks, emphasizing practical workflows, quality metrics, and real-world applicability.

Contribution

It introduces a novel benchmark targeting underrepresented tasks, with detailed evaluation protocols and analysis of state-of-the-art models' performance.

Findings

01

GPT-5.4 and Opus 4.7 outperform other models

02

Open-weight models perform poorly overall

03

Top models rely on codebase exploration and runtime reasoning

Abstract

We introduce SWE Atlas, a benchmark suite for coding agents spanning three professional software engineering workflows: Codebase Q&A (124 tasks), Test Writing (90 tasks), and Refactoring (70 tasks). SWE Atlas differs from prior SWE benchmarks in three key ways: it targets underrepresented but practically important task categories, uses comprehensive category-specific evaluation protocols, and adopts under-specified, agentic task formulations that better reflect real-world usage. Its evaluation framework combines programmatic checks with rubric-based assessment. This goes beyond functional correctness, evaluating software engineering quality, including test and refactor completeness, maintainability, reusable abstractions, and codebase hygiene. We evaluate a range of frontier and open-weight models on SWE Atlas and find that GPT-5.4 and Opus 4.7 achieve the strongest overall performance,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.