Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors
Ekaterina Kochmar, Kaushal Kumar Maurya, Kseniia Petukhova, KV Aditya Srivatsa, Ana\"is Tack, Justin Vasselli

TL;DR
This paper reports on the BEA 2025 shared task evaluating AI tutors' pedagogical abilities, highlighting current performance levels, approaches used, and the need for further improvement in AI-based educational dialogue systems.
Contribution
It introduces a shared task with multiple tracks for assessing AI tutor responses, providing a benchmark dataset and analysis of current model performances.
Findings
Best F1 score for mistake identification: 71.81
Guidance provision F1 score: 58.34
Tutor identification accuracy: 96.98
Abstract
This shared task has aimed to assess pedagogical abilities of AI tutors powered by large language models (LLMs), focusing on evaluating the quality of tutor responses aimed at student's mistake remediation within educational dialogues. The task consisted of five tracks designed to automatically evaluate the AI tutor's performance across key dimensions of mistake identification, precise location of the mistake, providing guidance, and feedback actionability, grounded in learning science principles that define good and effective tutor responses, as well as the track focusing on detection of the tutor identity. The task attracted over 50 international teams across all tracks. The submitted models were evaluated against gold-standard human annotations, and the results, while promising, show that there is still significant room for improvement in this domain: the best results for the four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEngineering Education and Technology
