Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment

Tyler Burleigh

arXiv:2604.19781·cs.CY·April 23, 2026

Do Small Language Models Know When They're Wrong? Confidence-Based Cascade Scoring for Educational Assessment

Tyler Burleigh

PDF

TL;DR

This paper investigates how small language models can use verbalized confidence to decide when to escalate difficult tasks to larger models, optimizing accuracy, cost, and latency in automated educational scoring.

Contribution

It demonstrates that confidence discrimination varies across small LMs and that effective confidence-based cascading can nearly match large model accuracy at reduced cost and latency.

Findings

01

Confidence discrimination varies widely among small LMs.

02

Lower LM confidence correlates with higher scoring difficulty.

03

Effective cascades nearly match large LM accuracy with significant cost and latency savings.

Abstract

Automated scoring of student work at scale requires balancing accuracy against cost and latency. In "cascade" systems, small language models (LMs) handle easier scoring tasks while escalating harder ones to larger LMs -- but the challenge is determining which cases to escalate. We explore verbalized confidence -- asking the LM to state a numerical confidence alongside its prediction -- as a routing signal. Using 2,100 expert-scored decisions from student-AI math conversations, we evaluate cascade systems built from GPT-5.4, Claude 4.5+, and Gemini 3.1 model pairs. We find that: (1) confidence discrimination varies widely across small LMs, with the best achieving AUROC 0.857 and the worst producing a near-degenerate confidence distribution; (2) confidence tracks human scoring difficulty, with lower LM confidence where annotators disagreed and took longer to score; (3) the best cascade…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.