Howzat? Appealing to Expert Judgement for Evaluating Human and AI Next-Step Hints for Novice Programmers

Neil C. C. Brown; Pierre Weill-Tessier; Juho Leinonen; Paul Denny; Michael K\"olling

arXiv:2411.18151·cs.CY·June 3, 2025

Howzat? Appealing to Expert Judgement for Evaluating Human and AI Next-Step Hints for Novice Programmers

Neil C. C. Brown, Pierre Weill-Tessier, Juho Leinonen, Paul Denny, Michael K\"olling

PDF

TL;DR

This study evaluates the quality of AI-generated next-step hints for novice programmers, finding that GPT-4 with optimized prompts can outperform human educators in producing pedagogically valuable hints.

Contribution

It demonstrates that well-designed prompts enable LLMs, especially GPT-4, to generate high-quality programming hints, surpassing human experts in effectiveness.

Findings

01

GPT-4 outperforms other models and human experts in hint quality

02

Optimal hints are 80-160 words long with a US grade 9 reading level

03

Multi-stage prompts significantly improve hint quality

Abstract

Motivation: Students learning to program often reach states where they are stuck and can make no forward progress. An automatically generated next-step hint can help them make forward progress and support their learning. It is important to know what makes a good hint or a bad hint, and how to generate good hints automatically in novice programming tools, for example using Large Language Models (LLMs). Method and participants: We recruited 44 Java educators from around the world to participate in an online study. We used a set of real student code states as hint-generation scenarios. Participants used a technique known as comparative judgement to rank a set of candidate next-step Java hints, which were generated by Large Language Models (LLMs) and by five human experienced educators. Participants ranked the hints without being told how they were generated. Findings: We found that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.