Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams
Xiuxiu Tang, G. Alex Ambrose, Ying Cheng

TL;DR
This study evaluates GPT-4o's ability to reliably score handwritten physics responses using skill-based rubrics, highlighting the importance of rubric clarity and LLM configuration for consistent AI-assisted grading.
Contribution
It demonstrates that well-structured rubrics and controlled LLM settings can achieve scoring reliability comparable to human raters in STEM assessments.
Findings
AI scoring agreement matches human inter-rater reliability for high- and low-performing responses.
Checklist-based rubrics improve scoring consistency over holistic approaches.
Prompting format influences AI scoring slightly, temperature has limited effect.
Abstract
Student responses in STEM assessments are often handwritten and combine symbolic expressions, calculations, and diagrams, creating substantial variation in format and interpretation. Despite their importance for evaluating students' reasoning, such responses are time-consuming to score and prone to rater inconsistency, particularly when partial credit is required. Recent advances in large language models (LLMs) have increased attention to AI-assisted scoring, yet evidence remains limited regarding how rubric design and LLM configurations influence reliability across performance levels. This study examined the reliability of AI-assisted scoring of undergraduate physics constructed responses using GPT-4o. Twenty authentic handwritten exam responses were scored across two rounds by four instructors and by the AI model using skill-based rubrics with differing levels of analytic granularity.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
