Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats
Will Yeadon, Tom Hardy, Paul Mackay, Elise Agra

TL;DR
This study evaluates the validity of large language models as automated judges across various physics assessment formats, revealing that their effectiveness depends on the task's criterion-referenceability and the assessment conditions.
Contribution
It demonstrates that LLM-based assessment validity varies significantly with task type and conditions, emphasizing the importance of criterion-referenceability for trustworthy AI grading.
Findings
LLMs perform well on physics questions with official solutions.
Essay marking by LLMs shows poor discriminative validity.
High validity achieved in code-based plot assessments.
Abstract
As large language models (LLMs) are increasingly considered for automated assessment and feedback, understanding when LLM marking can be trusted is essential. We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations against human markers under blind, solution-provided, false-solution, and exemplar-anchored conditions. For blind university exam questions, models achieve fractional mean absolute errors (fMAE) with robust discriminative validity (Spearman ). For secondary and university structured questions (), providing official solutions reduces MAE and strengthens validity (committee ); false solutions degrade absolute accuracy but leave rank ordering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Topic Modeling · Artificial Intelligence in Healthcare and Education
