OpenCompass: A Universal Evaluation Platform for Large Language Models

Maosong Cao; Kai Chen; Haodong Duan; Yixiao Fang; Tong Gao; Ge Jiaye; Mo Li; Hongwei Liu; Junnan Liu; Yuan Liu; Chengqi Lyu; Han Lyu; Ningsheng Ma; Zerun Ma; Yu Sun; Zhiyong Wu; Linchen Xiao; Jun Xu; Haochen Ye; Zhaohui Yu; Yike Yuan; Songyang Zhang; Yufeng Zhao; Fengzhe Zhou; Peiheng Zhou; Dongsheng Zhu; Lin Zhu; Jingming Zhuo

arXiv:2605.19276·cs.CL·May 20, 2026

OpenCompass: A Universal Evaluation Platform for Large Language Models

Maosong Cao, Kai Chen, Haodong Duan, Yixiao Fang, Tong Gao, Ge Jiaye, Mo Li, Hongwei Liu, Junnan Liu, Yuan Liu, Chengqi Lyu, Han Lyu, Ningsheng Ma, Zerun Ma, Yu Sun, Zhiyong Wu, Linchen Xiao, Jun Xu, Haochen Ye, Zhaohui Yu, Yike Yuan, Songyang Zhang, Yufeng Zhao, Fengzhe Zhou

PDF

TL;DR

OpenCompass is a scalable, modular platform designed for comprehensive, cross-domain evaluation of large language models, addressing current benchmark limitations.

Contribution

It introduces a flexible, high-concurrency evaluation platform with a modular architecture supporting diverse datasets and evaluation methods.

Findings

01

Supports multiple benchmark datasets across domains

02

Provides high compatibility and flexibility

03

Enables efficient large-scale LLM evaluation

Abstract

In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.