TL;DR
This paper evaluates methods for estimating the confidence of LLMs in automated grading, finding that self-reported confidence offers the best calibration and that larger models improve reliability, aiding selective automation.
Contribution
It systematically compares confidence estimation methods across multiple LLMs and datasets, highlighting the effectiveness of self-reported confidence for reliable automated assessment.
Findings
Self-reported confidence achieves the best calibration (avg ECE 0.166).
Larger models show substantially better calibration and discrimination.
Confidence estimates are strongly top-skewed, affecting threshold setting.
Abstract
Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be correct}. This enables selective automation where high-confidence predictions are processed automatically while uncertain cases are flagged for human review. We compare three confidence estimation methods (self-reported confidence, self-consistency voting, and token probability) across seven LLMs of varying scale (4B to 120B parameters) on three educational datasets: RiceChem (long-answer chemistry), SciEntsBank, and Beetle (short-answer science). Our experiments reveal that self-reported confidence consistently achieves the best calibration across all conditions (avg ECE 0.166 vs 0.229 for self-consistency). Surprisingly, self-consistency remains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
