TutorGym: A Testbed for Evaluating AI Agents as Tutors and Students
Daniel Weitekamp, Momin N. Siddiqui, Christopher J. MacLellan

TL;DR
TutorGym provides a standardized platform to evaluate AI agents as tutors and learners within existing intelligent tutoring systems, enabling direct assessment of their interactive and adaptive capabilities in educational contexts.
Contribution
It introduces TutorGym, a novel testbed that situates AI agents within real ITS interfaces for comprehensive evaluation of tutoring and learning behaviors.
Findings
Current LLMs perform poorly as tutors, with low accuracy in labeling incorrect actions.
LLMs can generate human-like learning curves when trained as students.
TutorGym supports diverse AI agents across 223 tutor domains.
Abstract
Recent improvements in large language model (LLM) performance on academic benchmarks, such as MATH and GSM8K, have emboldened their use as standalone tutors and as simulations of human learning. However, these new applications require more than evaluations of final solution generation. We introduce TutorGym to evaluate these applications more directly. TutorGym is a standard interface for testing artificial intelligence (AI) agents within existing intelligent tutoring systems (ITS) that have been tested and refined in classroom studies, including Cognitive Tutors (CTAT), Apprentice Tutors, and OATutors. TutorGym is more than a simple problem-solution benchmark, it situates AI agents within the interactive interfaces of existing ITSs. At each step of problem-solving, AI agents are asked what they would do as a tutor or as a learner. As tutors, AI agents are prompted to provide tutoring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
