EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

Guoqing Ma; Jia Zhu; Hanghui Guo; Weijie Shi; Yue Cui; Jiawei Shen; Zilong Li; Yidan Liang

arXiv:2512.00290·cs.CL·December 2, 2025

EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Jiawei Shen, Zilong Li, Yidan Liang

PDF

Open Access

TL;DR

EduEval is a comprehensive hierarchical benchmark designed to evaluate large language models in Chinese K-12 education, covering cognitive skills, authenticity of tasks, and scale, to ensure safe and effective deployment.

Contribution

It introduces the EduAbility Taxonomy, integrates authentic educational data, and provides a large-scale evaluation framework for LLMs in Chinese education.

Findings

01

Models excel at factual tasks but struggle with dialogue and creativity.

02

Open source models outperform proprietary ones on complex reasoning.

03

Few-shot prompting effectiveness varies across cognitive dimensions.

Abstract

Large language models (LLMs) demonstrate significant potential for educational applications. However, their unscrutinized deployment poses risks to educational standards, underscoring the need for rigorous evaluation. We introduce EduEval, a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education. This benchmark makes three key contributions: (1) Cognitive Framework: We propose the EduAbility Taxonomy, which unifies Bloom's Taxonomy and Webb's Depth of Knowledge to organize tasks across six cognitive dimensions including Memorization, Understanding, Application, Reasoning, Creativity, and Ethics. (2) Authenticity: Our benchmark integrates real exam questions, classroom conversation, student essays, and expert-designed prompts to reflect genuine educational challenges; (3) Scale: EduEval comprises 24 distinct task types with over 11,000 questions spanning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Text Readability and Simplification · Topic Modeling