Building Scaffolding Dialogue Data with LLM-Simulated Novices

Si Chen; Izzy Molnar; Ting Hua; Peiyu Li; Le Huy Khiem; G. Alex Ambrose; Jim Lang; Ronald Metoyer; Nitesh V. Chawla

arXiv:2508.04428·cs.AI·February 5, 2026

Building Scaffolding Dialogue Data with LLM-Simulated Novices

Si Chen, Izzy Molnar, Ting Hua, Peiyu Li, Le Huy Khiem, G. Alex Ambrose, Jim Lang, Ronald Metoyer, Nitesh V. Chawla

PDF

TL;DR

This paper introduces SimInstruct, a scalable tool that uses LLMs to simulate novice dialogues with experts, enabling the collection of high-quality scaffolding instructional data without real participants.

Contribution

The paper presents a novel expert-in-the-loop framework using LLMs to generate pedagogically rich scaffolding dialogues, enhancing data collection and model training in education.

Findings

01

SimInstruct dialogues are comparable to real mentoring in pedagogical relevance.

02

Persona traits influence expert engagement and dialogue quality.

03

Fine-tuned LLaMA outperforms GPT-4o in instructional quality.

Abstract

High-quality, multi-turn instructional dialogues between novices and experts are essential for developing AI systems that support teaching, learning, and decision-making. These dialogues often involve scaffolding -- the process by which an expert supports a novice's thinking through questions, feedback, and step-by-step guidance. However, such data are scarce due to privacy concerns in recording and the vulnerability inherent in help-seeking. We present SimInstruct, a scalable, expert-in-the-loop tool for collecting scaffolding dialogues. Using teaching development coaching as an example domain, SimInstruct simulates novice instructors via LLMs, varying their teaching challenges and LLM's persona traits, while human experts provide multi-turn feedback, reasoning, and instructional support. This design enables the creation of realistic, pedagogically rich dialogues without requiring real…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.