TL;DR
This paper systematically studies confidence estimation in multi-turn LLM interactions, introducing new metrics and a paradigm to evaluate calibration and monotonicity, revealing challenges and proposing a promising logit-based probe.
Contribution
It is the first to analyze confidence estimation in multi-turn conversations, proposing a formal framework, novel metrics, and a new probe method for better calibration and evidence tracking.
Findings
Widely-used confidence techniques struggle with calibration in multi-turn dialogues.
The proposed P(Sufficient) probe effectively tracks evidence accumulation.
New metrics like InfoECE provide better evaluation of confidence calibration.
Abstract
While confidence estimation is a promising direction for mitigating hallucinations in Large Language Models (LLMs), current research overwhelmingly focuses on single-turn settings. The dynamics of model confidence in multi-turn conversations, where context accumulates and ambiguity is progressively resolved, remain largely unexplored. This work presents the first systematic study of confidence estimation in multi-turn interactions, establishing a formal evaluation framework grounded in two key desiderata: per-turn calibration and monotonicity of confidence as more information becomes available. To facilitate this, we introduce novel metrics, including a length-normalized Expected Calibration Error (InfoECE), and a new "Hinter-Guesser" paradigm for generating controlled evaluation datasets. Our experiments reveal that widely-used confidence techniques struggle with calibration and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
