Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness

Haotian Deng; Chris Farber; Jiyoon Lee; David Tang

arXiv:2601.08843·cs.CL·January 15, 2026

Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness

Haotian Deng, Chris Farber, Jiyoon Lee, David Tang

PDF

Open Access

TL;DR

This paper evaluates the reliability of large language models as automated judges for rubric-based short-answer grading, focusing on alignment with experts, uncertainty management, and robustness to input variations.

Contribution

It introduces a systematic assessment of LLM-based grading, analyzing alignment, uncertainty trade-offs, and robustness, revealing strengths and limitations for educational assessment.

Findings

01

Strong alignment for binary grading tasks

02

Filtering low-confidence predictions improves accuracy

03

Model is robust to prompt injection but sensitive to synonyms

Abstract

Automated short-answer grading (ASAG) remains a challenging task due to the linguistic variability of student responses and the need for nuanced, rubric-aligned partial credit. While Large Language Models (LLMs) offer a promising solution, their reliability as automated judges in rubric-based settings requires rigorous assessment. In this paper, we systematically evaluate the performance of LLM-judges for rubric-based short-answer grading. We investigate three key aspects: the alignment of LLM grading with expert judgment across varying rubric complexities, the trade-off between uncertainty and accuracy facilitated by a consensus-based deferral mechanism, and the model's robustness under random input perturbations and adversarial attacks. Using the SciEntsBank benchmark and Qwen 2.5-72B, we find that alignment is strong for binary tasks but degrades with increased rubric granularity.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Hate Speech and Cyberbullying Detection