OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education

Min Zhang; Hao Chen; Hao Chen; Wenqi Zhang; Didi Zhu; Xin Lin; Bo Jiang; Aimin Zhou; Fei Wu; Kun Kuang

arXiv:2510.26422·cs.CL·October 31, 2025

OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education

Min Zhang, Hao Chen, Hao Chen, Wenqi Zhang, Didi Zhu, Xin Lin, Bo Jiang, Aimin Zhou, Fei Wu, Kun Kuang

PDF

TL;DR

OmniEduBench is a detailed Chinese educational benchmark that evaluates large language models across knowledge and cultivation dimensions, covering diverse subjects and question types, revealing significant performance gaps in current models.

Contribution

This paper introduces OmniEduBench, a comprehensive Chinese educational benchmark with diverse question types and dimensions, addressing limitations of prior benchmarks focused mainly on knowledge evaluation.

Findings

01

Gemini-2.5 Pro achieved over 60% accuracy in knowledge tasks.

02

QWQ model lagged human performance by nearly 30% in cultivation tasks.

03

Significant performance gaps highlight challenges in applying LLMs to education.

Abstract

With the rapid development of large language models (LLMs), various LLM-based works have been widely applied in educational fields. However, most existing LLMs and their benchmarks focus primarily on the knowledge dimension, largely neglecting the evaluation of cultivation capabilities that are essential for real-world educational scenarios. Additionally, current benchmarks are often limited to a single subject or question type, lacking sufficient diversity. This issue is particularly prominent within the Chinese context. To address this gap, we introduce OmniEduBench, a comprehensive Chinese educational benchmark. OmniEduBench consists of 24.602K high-quality question-answer pairs. The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension, which contain 18.121K and 6.481K entries, respectively. Each dimension is further subdivided…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.