Confidence Estimation in Automatic Short Answer Grading with LLMs
Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne

TL;DR
This paper explores confidence estimation methods for LLM-based automatic short answer grading, proposing a hybrid approach that combines model signals with dataset-derived uncertainty to improve reliability.
Contribution
It introduces a hybrid confidence framework that integrates model-based signals with dataset-derived uncertainty, enhancing confidence estimation in LLM-based grading.
Findings
Hybrid confidence measure improves reliability over single-source methods.
Clustering semantically embedded responses quantifies response heterogeneity.
Proposed approach enhances trustworthiness in AI-assisted educational assessment.
Abstract
Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-based grading remains imperfect, making reliable confidence estimates essential for safe and effective human-AI collaboration in educational decision-making. In this work, we investigate confidence estimation for ASAG with LLMs by jointly considering model-based confidence signals and dataset-derived uncertainty. We systematically compare three model-based confidence estimation strategies, namely verbalizing, latent, and consistency-based confidence estimation, and show that model-based confidence alone is insufficient to reliably capture uncertainty in ASAG. To address this limitation, we propose a hybrid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
