Pedagogy-driven Evaluation of Generative AI-powered Intelligent Tutoring Systems
Kaushal Kumar Maurya, Ekaterina Kochmar

TL;DR
This paper reviews current evaluation practices for AI-powered Intelligent Tutoring Systems, highlighting challenges and proposing research directions for standardized, pedagogically grounded assessment frameworks.
Contribution
It provides a comprehensive review of evaluation methods, identifies key challenges, and proposes new research directions based on learning science principles for ITS assessment.
Findings
Current evaluations rely on subjective and non-standardized benchmarks.
Challenges include lack of universally accepted, pedagogy-driven evaluation frameworks.
Proposes research directions for fair, scalable, and pedagogically grounded evaluation methods.
Abstract
The interdisciplinary research domain of Artificial Intelligence in Education (AIED) has a long history of developing Intelligent Tutoring Systems (ITSs) by integrating insights from technological advancements, educational theories, and cognitive psychology. The remarkable success of generative AI (GenAI) models has accelerated the development of large language model (LLM)-powered ITSs, which have potential to imitate human-like, pedagogically rich, and cognitively demanding tutoring. However, the progress and impact of these systems remain largely untraceable due to the absence of reliable, universally accepted, and pedagogy-driven evaluation frameworks and benchmarks. Most existing educational dialogue-based ITS evaluations rely on subjective protocols and non-standardized benchmarks, leading to inconsistencies and limited generalizability. In this work, we take a step back from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
