The Missing Evaluation Axis: What 10,000 Student Submissions Reveal About AI Tutor Effectiveness
Rose Niousha, Samantha Boatright Smith, Bita Akram, Peter Brusilovsky, Arto Hellas, Juho Leinonen, John DeNero, Narges Norouzi

TL;DR
This paper introduces a behavioral evaluation framework for AI tutors, analyzing student interactions with feedback to better assess effectiveness beyond pedagogical quality alone.
Contribution
It proposes a new evaluation approach based on student behavior data and applies it to real-world AI tutors, revealing insights missed by traditional assessments.
Findings
Student engagement varies significantly between different AI tutors.
Behavioral signals correlate more strongly with perceived helpfulness than pedagogical quality.
The framework uncovers differences in student actions not visible through feedback quality alone.
Abstract
Current Artificial Intelligence (AI)-based tutoring systems (AI tutors) are primarily evaluated based on the pedagogical quality of their feedback messages. While important, pedagogy alone is insufficient because it ignores a critical question: what do students actually do with the feedback they receive? We argue that AI tutor evaluation should be extended with a behavioral dimension grounded in student interaction data, which complements pedagogical assessment. We propose an evaluation framework and apply it to 10,235 code submissions with corresponding AI tutor feedback from an introductory undergraduate programming course to measure whether students act on tutor feedback and whether those actions are applied correctly. Using this framework to compare two deployed AI tutors across different semesters in a large-scale introductory computer science course reveals substantial differences…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
