Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment
Tyler Burleigh

TL;DR
This paper investigates how small language models can use verbalized confidence to decide when to escalate difficult tasks to larger models, optimizing accuracy, cost, and latency in automated educational scoring.
Contribution
It demonstrates that confidence discrimination varies across small LMs and that effective confidence-based cascading can nearly match large model accuracy at reduced cost and latency.
Findings
Confidence discrimination varies widely among small LMs.
Lower LM confidence correlates with higher scoring difficulty.
Effective cascades nearly match large LM accuracy with significant cost and latency savings.
Abstract
Automated scoring of student work at scale requires balancing accuracy against cost and latency. In "cascade" systems, small language models (LMs) handle easier scoring tasks while escalating harder ones to larger LMs -- but the challenge is determining which cases to escalate. We explore verbalized confidence -- asking the LM to state a numerical confidence alongside its prediction -- as a routing signal. Using 2,100 expert-scored decisions from student-AI math conversations, we evaluate cascade systems built from GPT-5.4, Claude 4.5+, and Gemini 3.1 model pairs. We find that: (1) confidence discrimination varies widely across small LMs, with the best achieving AUROC 0.857 and the worst producing a near-degenerate confidence distribution; (2) confidence tracks human scoring difficulty, with lower LM confidence where annotators disagreed and took longer to score; (3) the best cascade…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
