Evaluating LLMs for Answering Student Questions in Introductory Programming Courses
Thomas Van Mullem, Bart Mesuere, Peter Dawyndt

TL;DR
This paper evaluates Large Language Models' ability to assist in answering student questions in introductory programming courses, proposing a rigorous evaluation framework and highlighting the potential for models like Gemini 3 flash to outperform typical educator responses.
Contribution
It introduces a reproducible evaluation process with a custom pedagogical metric and advocates for a pre-deployment validation framework for educational LLM tools.
Findings
Gemini 3 flash surpasses typical educator response quality.
A custom LLM-as-a-Judge metric effectively assesses pedagogical accuracy.
A task-agnostic evaluation framework is proposed for educational LLM deployment.
Abstract
The rapid emergence of Large Language Models (LLMs) presents both opportunities and challenges for programming education. While students increasingly use generative AI tools, direct access often hinders the learning process by providing complete solutions rather than pedagogical hints. Concurrently, educators face significant workload and scalability challenges when providing timely, personalized feedback. This study investigates the capabilities of LLMs to safely and effectively assist educators in answering student questions within a CS1 programming course. To achieve this, we established a rigorous, reproducible evaluation process by curating a benchmark dataset of 170 authentic student questions from a learning management system, paired with ground-truth responses authored by subject matter experts. Because traditional text-matching metrics are insufficient for evaluating open-ended…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
