Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors
Kaushal Kumar Maurya, KV Aditya Srivatsa, Kseniia Petukhova, and, Ekaterina Kochmar

TL;DR
This paper introduces a comprehensive evaluation taxonomy and benchmark for assessing the pedagogical abilities of LLM-powered AI tutors, focusing on mathematical education and grounded in learning sciences principles.
Contribution
It proposes a unified pedagogical evaluation framework, releases MRBench benchmark with annotated responses, and analyzes LLMs' effectiveness as AI tutors versus question-answering systems.
Findings
Prometheus2 and Llama-3.1-8B show varying pedagogical abilities.
The taxonomy enables reliable assessment of AI tutors' pedagogical skills.
MRBench provides a standardized dataset for future evaluation.
Abstract
In this paper, we investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors and whether they demonstrate pedagogical abilities necessary for good AI tutoring in educational dialogues. Previous efforts towards evaluation have been limited to subjective protocols and benchmarks. To bridge this gap, we propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles, which is designed to assess the pedagogical value of LLM-powered AI tutor responses grounded in student mistakes or confusions in the mathematical domain. We release MRBench - a new evaluation benchmark containing 192 conversations and 1,596 responses from seven state-of-the-art LLM-based and human tutors, providing gold annotations for eight pedagogical dimensions. We assess reliability of the popular Prometheus2 and Llama-3.1-8B LLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
