Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B
Jon-Paul Cacioli

TL;DR
This study investigates whether fine-tuning small language models with self-consistency targets improves their verbal confidence calibration, revealing challenges and potential strategies through negative and exploratory results.
Contribution
It provides the first negative result on confidence calibration with self-consistency and demonstrates a post-hoc method that improves binary confidence discrimination.
Findings
Negative result: confidence calibration degraded with initial protocol.
Exploratory rescue improved AUROC2 to 0.774 on TriviaQA.
Accuracy on MMLU increased from 54.2% to 77.4% with the method.
Abstract
Small instruct-tuned LLMs produce degenerate verbal confidence under minimal elicitation: ceiling rates above 95%, near-chance Type-2 AUROC, and Invalid validity profiles. We test whether confidence-conditioned supervised fine-tuning (CSFT) with self-consistency-derived targets can close the gap between internal information and verbal readout. A pre-registered Phase 0 protocol on Gemma 3 4B-it with a modal filter restricting training to items with correct modal answers produced a negative result: AUROC2 dropped from 0.554 to 0.509 due to label-entropy collapse in the training targets. An exploratory rescue removed the filter, training on all 2,000 calibration items. This produced a binary verbal correctness discriminator with AUROC2 = 0.774 on held-out TriviaQA, compressing a 10-sample self-consistency signal (AUROC2 = 0.999) into a single-pass readout exceeding logit entropy (0.701).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
