SCALEFeedback: A Large-Scale Dataset of Synthetic Computer Science Assignments for LLM-generated Educational Feedback Research
Keyang Qian, Kaixun Yang, Wei Dai, Flora Jin, Yixin Cheng, Rui Guan, Sadia Nawaz, Zachari Swiecki, Guanliang Chen, Lixiang Yan, Dragan Ga\v{s}evi\'c

TL;DR
This paper introduces SCALEFeedback, a large-scale synthetic dataset of computer science assignments generated by LLMs, enabling research on automated educational feedback while protecting student privacy.
Contribution
It presents a novel SAM framework for creating realistic synthetic datasets from real assignments, facilitating scalable research in AI-driven education feedback.
Findings
Synthetic data closely matches real data in quality metrics
LLM-generated feedback is as effective as real feedback
The dataset protects student privacy and supports scalable research
Abstract
Using LLMs to give educational feedback to students for their assignments has attracted much attention in the AI in Education field. Yet, there is currently no large-scale open-source dataset of student assignments that includes detailed assignment descriptions, rubrics, and student submissions across various courses. As a result, research on generalisable methodology for automatic generation of effective and responsible educational feedback remains limited. In the current study, we constructed a large-scale dataset of Synthetic Computer science Assignments for LLM-generated Educational Feedback research (SCALEFeedback). We proposed a Sophisticated Assignment Mimicry (SAM) framework to generate the synthetic dataset by one-to-one LLM-based imitation from real assignment descriptions, student submissions to produce their synthetic versions. Our open-source dataset contains 10,000…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOnline Learning and Analytics
