Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical   Ability Assessment of LLM-Powered AI Tutors

Kaushal Kumar Maurya; KV Aditya Srivatsa; Kseniia Petukhova; and; Ekaterina Kochmar

arXiv:2412.09416·cs.CL·February 11, 2025

Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors

Kaushal Kumar Maurya, KV Aditya Srivatsa, Kseniia Petukhova, and, Ekaterina Kochmar

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a comprehensive evaluation taxonomy and benchmark for assessing the pedagogical abilities of LLM-powered AI tutors, focusing on mathematical education and grounded in learning sciences principles.

Contribution

It proposes a unified pedagogical evaluation framework, releases MRBench benchmark with annotated responses, and analyzes LLMs' effectiveness as AI tutors versus question-answering systems.

Findings

01

Prometheus2 and Llama-3.1-8B show varying pedagogical abilities.

02

The taxonomy enables reliable assessment of AI tutors' pedagogical skills.

03

MRBench provides a standardized dataset for future evaluation.

Abstract

In this paper, we investigate whether current state-of-the-art large language models (LLMs) are effective as AI tutors and whether they demonstrate pedagogical abilities necessary for good AI tutoring in educational dialogues. Previous efforts towards evaluation have been limited to subjective protocols and benchmarks. To bridge this gap, we propose a unified evaluation taxonomy with eight pedagogical dimensions based on key learning sciences principles, which is designed to assess the pedagogical value of LLM-powered AI tutor responses grounded in student mistakes or confusions in the mathematical domain. We release MRBench - a new evaluation benchmark containing 192 conversations and 1,596 responses from seven state-of-the-art LLM-based and human tutors, providing gold annotations for eight pedagogical dimensions. We assess reliability of the popular Prometheus2 and Llama-3.1-8B LLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kaushal0494/UnifyingAITutorEvaluation
noneOfficial

Videos

Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical Ability Assessment of LLM-Powered AI Tutors· underline

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning