Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness
Haotian Deng, Chris Farber, Jiyoon Lee, David Tang

TL;DR
This paper evaluates the reliability of large language models as automated judges for rubric-based short-answer grading, focusing on alignment with experts, uncertainty management, and robustness to input variations.
Contribution
It introduces a systematic assessment of LLM-based grading, analyzing alignment, uncertainty trade-offs, and robustness, revealing strengths and limitations for educational assessment.
Findings
Strong alignment for binary grading tasks
Filtering low-confidence predictions improves accuracy
Model is robust to prompt injection but sensitive to synonyms
Abstract
Automated short-answer grading (ASAG) remains a challenging task due to the linguistic variability of student responses and the need for nuanced, rubric-aligned partial credit. While Large Language Models (LLMs) offer a promising solution, their reliability as automated judges in rubric-based settings requires rigorous assessment. In this paper, we systematically evaluate the performance of LLM-judges for rubric-based short-answer grading. We investigate three key aspects: the alignment of LLM grading with expert judgment across varying rubric complexities, the trade-off between uncertainty and accuracy facilitated by a consensus-based deferral mechanism, and the model's robustness under random input perturbations and adversarial attacks. Using the SciEntsBank benchmark and Qwen 2.5-72B, we find that alignment is strong for binary tasks but degrades with increased rubric granularity.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Hate Speech and Cyberbullying Detection
