How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

Hang Li; Kaiqi Yang; Xianxuan Long; Fedor Filippov; Yucheng Chu; Yasemin Copur-Gencturk; Peng He; Cory Miller; Namsoo Shin; Joseph Krajcik; Hui Liu; Jiliang Tang

arXiv:2602.16039·cs.AI·February 19, 2026

How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

Hang Li, Kaiqi Yang, Xianxuan Long, Fedor Filippov, Yucheng Chu, Yasemin Copur-Gencturk, Peng He, Cory Miller, Namsoo Shin, Joseph Krajcik, Hui Liu, Jiliang Tang

PDF

Open Access

TL;DR

This paper benchmarks various uncertainty metrics for large language models in automatic educational assessment, highlighting their behaviors, strengths, and limitations to improve grading reliability.

Contribution

It systematically evaluates uncertainty quantification methods in LLM-based grading, providing insights into their applicability and guiding future development of reliable assessment systems.

Findings

01

Uncertainty behaviors vary across datasets and models.

02

Certain metrics show better calibration for grading tasks.

03

Model choice and decoding strategies significantly impact uncertainty estimates.

Abstract

The rapid rise of large language models (LLMs) is reshaping the landscape of automatic assessment in education. While these systems demonstrate substantial advantages in adaptability to diverse question types and flexibility in output formats, they also introduce new challenges related to output uncertainty, stemming from the inherently probabilistic nature of LLMs. Output uncertainty is an inescapable challenge in automatic assessment, as assessment results often play a critical role in informing subsequent pedagogical actions, such as providing feedback to students or guiding instructional decisions. Unreliable or poorly calibrated uncertainty estimates can lead to unstable downstream interventions, potentially disrupting students' learning processes and resulting in unintended negative consequences. To systematically understand this challenge and inform future research, we benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Student Assessment and Feedback