Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better Evaluation
Joe Stacey, Hadas Orgad, Kentaro Inui, Benjamin Heinzerling, Nafise Sadat Moosavi

TL;DR
This paper systematically evaluates supervised uncertainty probes in large language models, revealing poor robustness especially under distribution shifts, and suggests strategies for improvement.
Contribution
It provides a comprehensive analysis of probe robustness across models and tasks, highlighting the importance of probe input design and layer selection.
Findings
Middle-layer representations generalize more reliably than final-layer states.
Aggregating across response tokens improves robustness over single-token features.
Current methods show poor robustness, especially in long-form generation scenarios.
Abstract
Recent work has shown that the hidden states of large language models contain signals useful for uncertainty estimation and hallucination detection, motivating a growing interest in efficient probe-based approaches. Yet it remains unclear how robust existing methods are, and which probe designs provide uncertainty estimates that are reliable under distribution shift. We present a systematic study of supervised uncertainty probes across models, tasks, and OOD settings, training over 2,000 probes while varying the representation layer, feature type, and token aggregation strategy. Our evaluation highlights poor robustness in current methods, particularly in the case of long-form generations. We also find that probe robustness is driven less by architecture and more by the probe inputs. Middle-layer representations generalise more reliably than final-layer hidden states, and aggregating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
