SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
Jingxuan Xu, Ken Deng, Weihao Li, Songwei Yu, Huaixi Tang, Haoyang Huang, Zhiyi Lai, Zizheng Zhan, Yanan Wu, Chenchen Zhang, Kepeng Lei, Yifan Yao, Xinping Lei, Wenqiang Zhu, Zongxian Feng, Han Li, Junqi Xiong, Dailin Li, Zuchen Gao, Kun Wu, Wen Xiang, Ziqi Zhan, Yuanxing Zhang

TL;DR
SWE-Compass is a comprehensive, multi-dimensional benchmark designed to evaluate large language models' coding abilities across diverse tasks, languages, and real-world developer workflows, addressing limitations of previous narrow and biased assessments.
Contribution
The paper introduces SWE-Compass, a unified, structured benchmark covering multiple coding tasks, scenarios, and languages, aligned with real-world software engineering practices.
Findings
Hierarchical difficulty across tasks, languages, and scenarios identified.
State-of-the-art LLMs evaluated, revealing performance gaps.
Benchmark facilitates diagnosis and advancement of agentic coding abilities.
Abstract
Evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer workflows. Existing benchmarks often focus on algorithmic problems or Python-centric bug fixing, leaving critical dimensions of software engineering underexplored. To address these gaps, we introduce SWE-Compass1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework. SWE-Compass spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests and refined through systematic filtering and validation. We benchmark ten state-of-the-art LLMs under two agentic frameworks, SWE-Agent and Claude Code, revealing a clear hierarchy of difficulty across task types,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Artificial Intelligence in Healthcare and Education
