Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

Zixin Chen; Peng Liu; Rui Sheng; Haobo Li; Jianhong Tu; Xiaodong Deng; Kashun Shum; Dayiheng Liu; Huamin Qu

arXiv:2605.14322·cs.AI·May 22, 2026

Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

Zixin Chen, Peng Liu, Rui Sheng, Haobo Li, Jianhong Tu, Xiaodong Deng, Kashun Shum, Dayiheng Liu, Huamin Qu

PDF

TL;DR

EduAgentBench is a comprehensive benchmark designed to evaluate language agents' ability to perform complex, real-world teaching tasks, highlighting current models' strengths and limitations in pedagogical judgment and workflow execution.

Contribution

The paper introduces EduAgentBench, the first holistic, theory-grounded benchmark for assessing the full scope of teaching capabilities in language agents.

Findings

01

Models perform well in bounded pedagogical judgment.

02

Models fall short in situated multi-turn tutoring.

03

Models are limited in autonomous teaching workflow execution.

Abstract

Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.