Howzat? Appealing to Expert Judgement for Evaluating Human and AI Next-Step Hints for Novice Programmers
Neil C. C. Brown, Pierre Weill-Tessier, Juho Leinonen, Paul Denny, Michael K\"olling

TL;DR
This study evaluates the quality of AI-generated next-step hints for novice programmers, finding that GPT-4 with optimized prompts can outperform human educators in producing pedagogically valuable hints.
Contribution
It demonstrates that well-designed prompts enable LLMs, especially GPT-4, to generate high-quality programming hints, surpassing human experts in effectiveness.
Findings
GPT-4 outperforms other models and human experts in hint quality
Optimal hints are 80-160 words long with a US grade 9 reading level
Multi-stage prompts significantly improve hint quality
Abstract
Motivation: Students learning to program often reach states where they are stuck and can make no forward progress. An automatically generated next-step hint can help them make forward progress and support their learning. It is important to know what makes a good hint or a bad hint, and how to generate good hints automatically in novice programming tools, for example using Large Language Models (LLMs). Method and participants: We recruited 44 Java educators from around the world to participate in an online study. We used a set of real student code states as hint-generation scenarios. Participants used a technique known as comparative judgement to rank a set of candidate next-step Java hints, which were generated by Large Language Models (LLMs) and by five human experienced educators. Participants ranked the hints without being told how they were generated. Findings: We found that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
