SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset
Mei Jiang, Houping Yue, Bingdong Li, Hao Hao, Ying Qian, Bo Jiang, and Aimin Zhou

TL;DR
This paper introduces SID, a comprehensive benchmark dataset and evaluation framework for assessing large language models' ability to provide guided, interdisciplinary Socratic instruction in STEM education, revealing current limitations.
Contribution
The paper presents the first benchmark for evaluating LLMs' higher-order guidance in interdisciplinary STEM dialogues, including a large dataset, annotation schema, and new metrics.
Findings
State-of-the-art LLMs struggle with effective guided instruction in STEM dialogues.
The SID benchmark reveals gaps in LLMs' pedagogical capabilities.
Baseline results highlight the need for developing more pedagogically-aware models.
Abstract
Fostering students' abilities for knowledge integration and transfer in complex problem-solving scenarios is a core objective of modern education, and interdisciplinary STEM is a key pathway to achieve this, yet it requires expert guidance that is difficult to scale. While LLMs offer potential in this regard, their true capability for guided instruction remains unclear due to the lack of an effective evaluation benchmark. To address this, we introduce SID, the first benchmark designed to systematically evaluate the higher-order guidance capabilities of LLMs in multi-turn, interdisciplinary Socratic dialogues. Our contributions include a large-scale dataset of 10,000 dialogue turns across 48 complex STEM projects, a novel annotation schema for capturing deep pedagogical features, and a new suite of evaluation metrics (e.g., X-SRG). Baseline experiments confirm that even state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
