Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows
Zixin Chen, Peng Liu, Rui Sheng, Haobo Li, Jianhong Tu, Xiaodong Deng, Kashun Shum, Dayiheng Liu, Huamin Qu

TL;DR
EduAgentBench is a comprehensive benchmark designed to evaluate language agents' ability to perform complex, real-world teaching tasks, highlighting current models' strengths and limitations in pedagogical judgment and workflow execution.
Contribution
The paper introduces EduAgentBench, the first holistic, theory-grounded benchmark for assessing the full scope of teaching capabilities in language agents.
Findings
Models perform well in bounded pedagogical judgment.
Models fall short in situated multi-turn tutoring.
Models are limited in autonomous teaching workflow execution.
Abstract
Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
