SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset

Mei Jiang; Houping Yue; Bingdong Li; Hao Hao; Ying Qian; Bo Jiang; and Aimin Zhou

arXiv:2508.04563·cs.AI·August 7, 2025

SID: Benchmarking Guided Instruction Capabilities in STEM Education with a Socratic Interdisciplinary Dialogues Dataset

Mei Jiang, Houping Yue, Bingdong Li, Hao Hao, Ying Qian, Bo Jiang, and Aimin Zhou

PDF

TL;DR

This paper introduces SID, a comprehensive benchmark dataset and evaluation framework for assessing large language models' ability to provide guided, interdisciplinary Socratic instruction in STEM education, revealing current limitations.

Contribution

The paper presents the first benchmark for evaluating LLMs' higher-order guidance in interdisciplinary STEM dialogues, including a large dataset, annotation schema, and new metrics.

Findings

01

State-of-the-art LLMs struggle with effective guided instruction in STEM dialogues.

02

The SID benchmark reveals gaps in LLMs' pedagogical capabilities.

03

Baseline results highlight the need for developing more pedagogically-aware models.

Abstract

Fostering students' abilities for knowledge integration and transfer in complex problem-solving scenarios is a core objective of modern education, and interdisciplinary STEM is a key pathway to achieve this, yet it requires expert guidance that is difficult to scale. While LLMs offer potential in this regard, their true capability for guided instruction remains unclear due to the lack of an effective evaluation benchmark. To address this, we introduce SID, the first benchmark designed to systematically evaluate the higher-order guidance capabilities of LLMs in multi-turn, interdisciplinary Socratic dialogues. Our contributions include a large-scale dataset of 10,000 dialogue turns across 48 complex STEM projects, a novel annotation schema for capturing deep pedagogical features, and a new suite of evaluation metrics (e.g., X-SRG). Baseline experiments confirm that even state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.