EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

Bin Xu; Yu Bai; Huashan Sun; Yiguan Lin; Siming Liu; Xinyue Liang; Yaolin Li; Zhuangzhi Dong; Jingren Zhang; Yufan Deng; Xinyu Zou; Yang Gao; Heyan Huang

arXiv:2505.16160·cs.CL·January 7, 2026

EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

Bin Xu, Yu Bai, Huashan Sun, Yiguan Lin, Siming Liu, Xinyue Liang, Yaolin Li, Zhuangzhi Dong, Jingren Zhang, Yufan Deng, Xinyu Zou, Yang Gao, Heyan Huang

PDF

1 Repo 1 Models 1 Datasets

TL;DR

EduBench introduces a diverse, educational-specific benchmark dataset with multi-dimensional evaluation metrics, enabling better assessment and development of language models in educational contexts, with promising results from training smaller models.

Contribution

This paper presents the first comprehensive educational benchmark dataset and evaluation framework, along with a small-scale model achieving competitive performance to larger models.

Findings

01

The dataset covers 9 major educational scenarios and over 4,000 contexts.

02

Proposed metrics evaluate 12 critical aspects for educational language models.

03

Small models trained on EduBench perform comparably to state-of-the-art large models.

Abstract

As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ybai-nlp/edubench
noneOfficial

Models

🤗
DirectionAI/EDU-Qwen2.5-7B
model· 5 dl· ♡ 2
5 dl♡ 2

Datasets

DirectionAI/EduBench
dataset· 169 dl
169 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training