Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats

Will Yeadon; Tom Hardy; Paul Mackay; Elise Agra

arXiv:2603.14732·physics.ed-ph·March 17, 2026

Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats

Will Yeadon, Tom Hardy, Paul Mackay, Elise Agra

PDF

Open Access

TL;DR

This study evaluates the validity of large language models as automated judges across various physics assessment formats, revealing that their effectiveness depends on the task's criterion-referenceability and the assessment conditions.

Contribution

It demonstrates that LLM-based assessment validity varies significantly with task type and conditions, emphasizing the importance of criterion-referenceability for trustworthy AI grading.

Findings

01

LLMs perform well on physics questions with official solutions.

02

Essay marking by LLMs shows poor discriminative validity.

03

High validity achieved in code-based plot assessments.

Abstract

As large language models (LLMs) are increasingly considered for automated assessment and feedback, understanding when LLM marking can be trusted is essential. We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations against human markers under blind, solution-provided, false-solution, and exemplar-anchored conditions. For $n = 771$ blind university exam questions, models achieve fractional mean absolute errors (fMAE) $\approx 0.22$ with robust discriminative validity (Spearman $ρ > 0.6$ ). For secondary and university structured questions ( $n = 1151$ ), providing official solutions reduces MAE and strengthens validity (committee $ρ = 0.88$ ); false solutions degrade absolute accuracy but leave rank ordering…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Artificial Intelligence in Healthcare and Education