ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios
Shou'ang Wei, Xinyun Wang, Shuzhen Bi, Jian Chen, Ruijia Li, Bo Jiang, Xin Lin, Min Zhang, Yu Song, BingDong Li, Aimin Zhou, Hao Hao

TL;DR
ELMES is an open-source framework that automates the evaluation of large language models in educational scenarios, addressing the lack of specialized assessment metrics and enabling flexible, pedagogically relevant benchmarking.
Contribution
We introduce ELMES, a modular, hybrid evaluation framework tailored for assessing LLMs in education, with a novel LLM-as-a-Judge methodology and scenario-specific metrics.
Findings
Distinct capability profiles for different models across scenarios
ELMES effectively quantifies pedagogical metrics objectively
Framework reduces barriers for educational LLM evaluation
Abstract
The emergence of Large Language Models (LLMs) presents transformative opportunities for education, generating numerous novel application scenarios. However, significant challenges remain: evaluation metrics vary substantially across different educational scenarios, while many emerging scenarios lack appropriate assessment metrics. Current benchmarks predominantly measure general intelligence rather than pedagogical capabilities. To address this gap, we introduce ELMES, an open-source automated evaluation framework specifically designed for assessing LLMs in educational settings. ELMES features a modular architecture that enables researchers to create dynamic, multi-agent dialogues through simple configuration files, facilitating flexible scenario design without requiring extensive programming expertise. The framework incorporates a hybrid evaluation engine that objectively quantifies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗sii-research/InnoSpark-72B-0710model· 17 dl· ♡ 617 dl♡ 6
- 🤗sii-research/InnoSpark-7B-0715model· 6 dl· ♡ 26 dl♡ 2
- 🤗sii-research/InnoSpark-0.5B-0717model· 6 dl· ♡ 16 dl♡ 1
- 🤗sii-research/InnoSpark-HPC-RM-32Bmodel· 8 dl· ♡ 28 dl♡ 2
- 🤗sii-research/InnoSpark-R-72B-0701model· 7 dl· ♡ 37 dl♡ 3
- 🤗sii-research/InnoSpark-72B-1124model· 5 dl5 dl
- 🤗sii-research/InnoSpark-72B-1224model· 3 dl· ♡ 23 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
