When AI Evaluates Its Own Work: Validating Learner-Initiated, AI-Generated Physics Practice Problems
Tobias Geisler, Gerd Kortemeyer

TL;DR
This study explores how AI-generated physics problems can be efficiently validated using automated checks that align with student preferences, enabling scalable, real-time formative assessment.
Contribution
It identifies a minimal set of automated quality attributes that reliably predict student preferences, facilitating practical deployment of AI-generated practice problems.
Findings
A small subset of metrics suffices for reliable student preference prediction.
Automated checks can ensure technical soundness and user appeal without exhaustive scoring.
Scalable formative assessment is feasible with curated core checks.
Abstract
Large language models (LLMs) can now generate physics practice problems in real time, yet the educational value of these items hinges on rapid, reliable post-generation vetting. In this exploratory study, we investigated which automated checks are both technically feasible and pedagogically meaningful when exercises are produced on demand within a chatbot interface. A cohort of 34 introductory-physics students generated and attempted 543 practice problems during exam preparation. Each item was labeled by an expert on a wide range of quality attributes and presented to the learners in pairs to record their preference. We then (i) benchmarked three commodity LLMs as ``judges'' against the expert labels, (ii) quantified which attributes predict student choice via random-forest models, and (iii) triangulated these results with free-form exit surveys. Only a small subset of the original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
