Discerning minds or generic tutors? Evaluating instructional guidance capabilities in Socratic LLMs
Ying Liu, Can Li, Ting Zhang, Mei Wang, Qiannan Zhu, Jian Li, Hua Huang

TL;DR
This paper introduces GuideEval, a benchmark for assessing LLMs' ability to adaptively guide learners in educational dialogues, highlighting current shortcomings and proposing finetuning strategies to improve pedagogical guidance.
Contribution
It presents a new benchmark and evaluation framework for LLMs' instructional guidance, along with a finetuning method to enhance their adaptive pedagogical capabilities.
Findings
Existing LLMs struggle with effective adaptive scaffolding.
Behavior-guided finetuning improves guidance performance.
The study emphasizes learner-centered, state-aware interaction in Socratic dialogues.
Abstract
The conversational capabilities of large language models hold significant promise for enabling scalable and interactive tutoring. While prior research has primarily examined their ability to generate Socratic questions, it often overlooks a critical aspect: adaptively guiding learners in accordance with their cognitive states. This study moves beyond question generation to emphasize instructional guidance capability. We ask: Can LLMs emulate expert tutors who dynamically adjust strategies in response to learners' states? To investigate this, we propose GuideEval, a benchmark grounded in authentic educational dialogues that evaluates pedagogical guidance through a three-phase behavioral framework: (1) Perception, inferring learner states; (2) Orchestration, adapting instructional strategies; and (3) Elicitation, stimulating proper reflections. Empirical results indicate that existing…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Clear behavioral decomposition with actionable metrics. The three-phase split translates “be a better tutor” into concrete, checkable behaviors, offering conceptual clarity and operational guidance that enable reproducible, phase-wise diagnosis across different models. 2. Useful failure taxonomy grounded in qualitative evidence. The paper goes beyond reporting average behaviors and highlights failure modes supported by dialogue snippets, providing interpretability and practical insight into
1. Human–LLM agreement is reported without sample size or reliability statistics. In Table 3, the claim that “LLMs can serve as reliable and scalable evaluators of instructional behaviors” rests on high agreement ratios and minimal score deviations, but the paper omits sample size per metric or level, sampling protocol, number of human raters, and inter-rater reliability. Without these, chance agreement and selection bias cannot be ruled out, especially with coarse labels (binary or 3-point) tha
1. The paper focuses on critical gap in LLM tutoring evaluation by focusing on adaptive guidance rather than static content quality. 2. The three-phase model is well-motivated by educational psychology literature and operationalized into measurable metrics. 3. The exp covers 14 diverse models, revealing consistent failure patterns across architectures. 4.The paper contains a detailed failure case analysis, providing an intuitive understanding beyond quantitative metrics. 5. I really like the
1. it comes with limited scope: dataset topic - middle school science problems in Chinese. It would be more curated if you expand it to other difficulty levels and languages. 2. The cognitive modeling with 4 states (Accurate, Erroneous, Comprehension, Confusion) may be too simplified to capture nuanced learning states. As authors acknowledge, it doesn't capture individual learner profiles, misconception history, or engagement patterns
- The paper conducted an extensive analysis of various LLMs and provides several insights on their capability to recognize learner states, to guide / scaffold, and to elicit further follow-ups. - The collected dataset GuideEval can help advance the field further. - LLM-based scoring were validated with human annotations - The failure analysis provides useful insights
- The authors evaluated the consistency between LLM based scoring and the Human annotators using the proportion of the same labels. I am not sure if this is the right way to go about it since simply showing the proportion of agreement can be misleading, especially if there is an imbalance in the label distribution. I believe there are more appropriate inter-rater agreement metrics that account for these. - The failure analysis categorizes the types of failures but the authors did not seem to pro
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Innovative Teaching and Learning Methods · Topic Modeling
