Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B

Jon-Paul Cacioli

arXiv:2604.24070·cs.CL·April 28, 2026

Distilling Self-Consistency into Verbal Confidence: A Pre-Registered Negative Result and Post-Hoc Rescue on Gemma 3 4B

Jon-Paul Cacioli

PDF

TL;DR

This study investigates whether fine-tuning small language models with self-consistency targets improves their verbal confidence calibration, revealing challenges and potential strategies through negative and exploratory results.

Contribution

It provides the first negative result on confidence calibration with self-consistency and demonstrates a post-hoc method that improves binary confidence discrimination.

Findings

01

Negative result: confidence calibration degraded with initial protocol.

02

Exploratory rescue improved AUROC2 to 0.774 on TriviaQA.

03

Accuracy on MMLU increased from 54.2% to 77.4% with the method.

Abstract

Small instruct-tuned LLMs produce degenerate verbal confidence under minimal elicitation: ceiling rates above 95%, near-chance Type-2 AUROC, and Invalid validity profiles. We test whether confidence-conditioned supervised fine-tuning (CSFT) with self-consistency-derived targets can close the gap between internal information and verbal readout. A pre-registered Phase 0 protocol on Gemma 3 4B-it with a modal filter restricting training to items with correct modal answers produced a negative result: AUROC2 dropped from 0.554 to 0.509 due to label-entropy collapse in the training targets. An exploratory rescue removed the filter, training on all 2,000 calibration items. This produced a binary verbal correctness discriminator with AUROC2 = 0.774 on held-out TriviaQA, compressing a 10-sample self-consistency signal (AUROC2 = 0.999) into a single-pass readout exceeding logit entropy (0.701).…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.