Synthetic Students: A Comparative Study of Bug Distribution Between Large Language Models and Computing Students
Stephen MacNeil, Magdalena Rogalska, Juho Leinonen, Paul Denny, Arto, Hellas, Xandria Crosland

TL;DR
This study compares bug patterns in code generated by large language models and real students, finding that guided LLMs can produce realistic error distributions useful for educational data simulation.
Contribution
It demonstrates that with proper guidance, LLMs can generate synthetic student bug data that closely mimics real student error patterns, enhancing educational research tools.
Findings
Unguided LLMs do not produce realistic bug distributions.
Guided prompts enable LLMs to generate plausible error patterns.
Realistic synthetic bug data can aid educational tool development.
Abstract
Large language models (LLMs) present an exciting opportunity for generating synthetic classroom data. Such data could include code containing a typical distribution of errors, simulated student behaviour to address the cold start problem when developing education tools, and synthetic user data when access to authentic data is restricted due to privacy reasons. In this research paper, we conduct a comparative study examining the distribution of bugs generated by LLMs in contrast to those produced by computing students. Leveraging data from two previous large-scale analyses of student-generated bugs, we investigate whether LLMs can be coaxed to exhibit bug patterns that are similar to authentic student bugs when prompted to inject errors into code. The results suggest that unguided, LLMs do not generate plausible error distributions, and many of the generated errors are unlikely to be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
