When Verification Hurts: Asymmetric Effects of Multi-Agent Feedback in Logic Proof Tutoring
Tahreem Yasir, Sutapa Dey Tithi, Benyamin Tabarsi, Dmitri Droujkov, Sam Gilson Yasitha Rajapaksha, Xiaoyi Tian, Arun Ramesh, DongKuan (DK) Xu, Tiffany Barnes

TL;DR
This paper investigates how verification feedback in logic proof tutoring can both help and hinder learning, revealing that its effectiveness depends on the accuracy of initial feedback and proof complexity.
Contribution
It introduces a new benchmark and analysis framework for step-level feedback in propositional logic tutoring, highlighting asymmetric effects of verification.
Findings
Verification improves outcomes when initial feedback is error-prone (<70% accuracy).
Verification degrades performance when feedback is already reliable (>85%).
No model reliably solves proof states with complexity above 4-5.
Abstract
Large language models (LLMs) are increasingly used for automated tutoring, but their reliability in structured symbolic domains remains unclear. We study step-level feedback for propositional logic proofs, which require precise symbolic reasoning aligned with a learner's current proof state. We introduce a knowledge-graph-grounded benchmark of 516 unique proof states with step-level annotations and difficulty metrics. Unlike prior tutoring evaluations that rely on model self-assessment or binary correctness, our framework enables fine-grained analysis of feedback quality against verified solution paths. We evaluate three role-specialized pipelines with varying solution access: Tutor (partial solution access), Teacher (full derivation access), and Judge (verification of Tutor feedback). Our results reveal a striking asymmetry: verification improves outcomes when upstream feedback is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
