When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

Robinson Ferrer; Damla Turgut; Zhongzhou Chen; and Shashank Sonkar

arXiv:2603.29559·cs.CL·April 1, 2026

When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

Robinson Ferrer, Damla Turgut, Zhongzhou Chen, and Shashank Sonkar

PDF

1 Repo

TL;DR

This paper evaluates methods for estimating the confidence of LLMs in automated grading, finding that self-reported confidence offers the best calibration and that larger models improve reliability, aiding selective automation.

Contribution

It systematically compares confidence estimation methods across multiple LLMs and datasets, highlighting the effectiveness of self-reported confidence for reliable automated assessment.

Findings

01

Self-reported confidence achieves the best calibration (avg ECE 0.166).

02

Larger models show substantially better calibration and discrimination.

03

Confidence estimates are strongly top-skewed, affecting threshold setting.

Abstract

Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sonkar-lab/llm_grading_calibration
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.