Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most
Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, Xiaoyi Tian, Tiffany Barnes

TL;DR
This study benchmarks LLM-based tutoring agents in propositional logic, revealing they excel at identifying optimal solutions but struggle with valid suboptimal and incorrect solutions, highlighting limitations in diagnostic accuracy and pedagogical usefulness.
Contribution
It introduces a comprehensive benchmark for LLM tutoring agents, exposing their diagnostic shortcomings and proposing hybrid architectures for improved tutoring effectiveness.
Findings
LLMs perform near-ceiling on optimal solutions
LLMs over-reject valid but suboptimal solutions
LLMs over-validate incorrect solutions
Abstract
Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
