LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy   Code Submissions for Computing Education

Juho Leinonen; Paul Denny; Olli Kiljunen; Stephen MacNeil; Sami Sarsa,; Arto Hellas

arXiv:2411.10455·cs.CY·November 19, 2024

LLM-itation is the Sincerest Form of Data: Generating Synthetic Buggy Code Submissions for Computing Education

Juho Leinonen, Paul Denny, Olli Kiljunen, Stephen MacNeil, Sami Sarsa,, Arto Hellas

PDF

Open Access

TL;DR

This paper demonstrates that large language models can generate synthetic buggy code submissions that closely resemble real student data, addressing privacy concerns and aiding computing education research.

Contribution

It introduces a method using GPT-4o to create realistic synthetic buggy code data, enabling privacy-preserving research and development in computing education.

Findings

01

Synthetic data closely matches real student test failure distributions.

02

LLMs can generate diverse, realistic incorrect code submissions.

03

Synthetic datasets can support research without compromising student privacy.

Abstract

There is a great need for data in computing education research. Data is needed to understand how students behave, to train models of student behavior to optimally support students, and to develop and validate new assessment tools and learning analytics techniques. However, relatively few computing education datasets are shared openly, often due to privacy regulations and issues in making sure the data is anonymous. Large language models (LLMs) offer a promising approach to create large-scale, privacy-preserving synthetic data, which can be used to explore various aspects of student learning, develop and test educational technologies, and support research in areas where collecting real student data may be challenging or impractical. This work explores generating synthetic buggy code submissions for introductory programming exercises using GPT-4o. We compare the distribution of test case…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOpen Education and E-Learning · Digital Rights Management and Security · Mathematics, Computing, and Information Processing