SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Jingxuan Xu; Ken Deng; Weihao Li; Songwei Yu; Huaixi Tang; Haoyang Huang; Zhiyi Lai; Zizheng Zhan; Yanan Wu; Chenchen Zhang; Kepeng Lei; Yifan Yao; Xinping Lei; Wenqiang Zhu; Zongxian Feng; Han Li; Junqi Xiong; Dailin Li; Zuchen Gao; Kun Wu; Wen Xiang; Ziqi Zhan; Yuanxing Zhang; Wuxuan Gong; Ziyuan Gao; Guanxiang Wang; Yirong Xue; Mengtong Li; Mengfei Xie; Xiaojiang Zhang; Jinghui Wang; Wenhao Zhuang; Zheng Lin; Huiming Wang; Zhaoxiang Zhang; Yuqun Zhang; Haotian Zhang; Bin Chen; Jiaheng Liu

arXiv:2511.05459·cs.SE·November 12, 2025

SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models

Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zongxian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang

PDF

Open Access 1 Models 1 Datasets

TL;DR

SWE-Compass is a comprehensive, multi-dimensional benchmark designed to evaluate large language models' coding abilities across diverse tasks, languages, and real-world developer workflows, addressing limitations of previous narrow and biased assessments.

Contribution

The paper introduces SWE-Compass, a unified, structured benchmark covering multiple coding tasks, scenarios, and languages, aligned with real-world software engineering practices.

Findings

01

Hierarchical difficulty across tasks, languages, and scenarios identified.

02

State-of-the-art LLMs evaluated, revealing performance gaps.

03

Benchmark facilitates diagnosis and advancement of agentic coding abilities.

Abstract

Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
shunxing1234/test_2
model

Datasets

Kwaipilot/SWE-Compass
dataset· 224 dl
224 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Artificial Intelligence in Healthcare and Education