Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

Tahreem Yasir; Wenbo Li; Sam Gilson; Sutapa Dey Tithi; Xiaoyi Tian; Tiffany Barnes

arXiv:2605.16207·cs.AI·May 18, 2026

Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, Xiaoyi Tian, Tiffany Barnes

PDF

TL;DR

This study benchmarks LLM-based tutoring agents in propositional logic, revealing they excel at identifying optimal solutions but struggle with valid suboptimal and incorrect solutions, highlighting limitations in diagnostic accuracy and pedagogical usefulness.

Contribution

It introduces a comprehensive benchmark for LLM tutoring agents, exposing their diagnostic shortcomings and proposing hybrid architectures for improved tutoring effectiveness.

Findings

01

LLMs perform near-ceiling on optimal solutions

02

LLMs over-reject valid but suboptimal solutions

03

LLMs over-validate incorrect solutions

Abstract

Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.