OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models

Yuhe Liu; Changhua Pei; Longlong Xu; Bohan Chen; Mingze Sun; Zhirui Zhang; Yongqian Sun; Shenglin Zhang; Kun Wang; Haiming Zhang; Jianhui Li; Gaogang Xie; Xidao Wen; Xiaohui Nie; Minghua Ma; Dan Pei

arXiv:2310.07637·cs.AI·June 18, 2025·1 cites

OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models

Yuhe Liu, Changhua Pei, Longlong Xu, Bohan Chen, Mingze Sun, Zhirui Zhang, Yongqian Sun, Shenglin Zhang, Kun Wang, Haiming Zhang, Jianhui Li, Gaogang Xie, Xidao Wen, Xiaohui Nie, Minghua Ma, Dan Pei

PDF

Open Access 1 Repo 1 Datasets

TL;DR

OpsEval is a comprehensive benchmark suite designed to evaluate large language models' performance in IT operations tasks, including multi-choice questions and QA formats in English and Chinese, with open access and ongoing updates.

Contribution

This paper introduces OpsEval, the first extensive benchmark for assessing LLMs in IT operations scenarios, including a large dataset, expert review, and an online leaderboard.

Findings

01

Current LLMs show varied performance across Ops tasks.

02

Model techniques significantly influence Ops performance.

03

Hallucination issues affect LLM reliability in Ops.

Abstract

Information Technology (IT) Operations (Ops), particularly Artificial Intelligence for IT Operations (AIOps), is the guarantee for maintaining the orderly and stable operation of existing information systems. According to Gartner's prediction, the use of AI technology for automated IT operations has become a new trend. Large language models (LLMs) that have exhibited remarkable capabilities in NLP-related tasks, are showing great potential in the field of AIOps, such as in aspects of root cause analysis of failures, generation of operations and maintenance scripts, and summarizing of alert information. Nevertheless, the performance of current LLMs in Ops tasks is yet to be determined. In this paper, we present OpsEval, a comprehensive task-oriented Ops benchmark designed for LLMs. For the first time, OpsEval assesses LLMs' proficiency in various crucial scenarios at different ability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

netmanaiops/opseval-datasets
noneOfficial

Datasets

Junetheriver/OpsEval
dataset· 143 dl
143 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Data Quality and Management · Software Engineering Research

MethodsFocus