Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

Jon-Paul Cacioli

arXiv:2604.22215·cs.CL·April 27, 2026

Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

Jon-Paul Cacioli

PDF

1 Datasets

TL;DR

This study evaluates whether open-weight instruction-tuned large language models produce valid verbal confidence estimates, finding that minimal verbal elicitation methods fail to reliably reflect internal uncertainty signals.

Contribution

It provides a pre-registered psychometric validation showing that current instruction-tuned models do not produce valid verbal confidence estimates under minimal elicitation.

Findings

01

All models failed the validity criteria for numeric confidence.

02

Categorical confidence elicitation disrupted task performance.

03

Token logprobability did not predict verbal confidence effectively.

Abstract

Verbal confidence elicitation is widely used to extract uncertainty estimates from LLMs. We tested whether seven instruction-tuned open-weight models (3-9B parameters, four families) produce verbalised confidence that meets minimal validity criteria for item-level Type-2 discrimination under minimal numeric elicitation with greedy decoding. In a pre-registered study (OSF: osf.io/azbvx), 524 TriviaQA items were administered under numeric (0-100) and categorical (10-class) elicitation to eight models at Q5_K_M quantisation on consumer hardware, yielding 8,384 deterministic trials. A psychometric validity screen was applied to each model-format cell. All seven instruct models were classified Invalid on numeric confidence (H2 confirmed, 7/7 vs. predicted >=4/7), with a mean ceiling rate of 91.7% (H1 confirmed). Categorical elicitation did not rescue validity. Instead, it disrupted task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

synthiumjp/verbal-confidence-saturation
dataset· 1.3k dl
1.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.