Confidence Should Be Calibrated More Than One Turn Deep
Zhaohan Zhang, Chengzhengxu Li, Xiaoming Liu, Chao Shen, Ziquan Liu, Ioannis Patras

TL;DR
This paper introduces multi-turn calibration for LLMs, emphasizing the importance of dynamic confidence estimation in multi-turn conversations to improve trustworthiness and reliability.
Contribution
It proposes MTCal for multi-turn calibration and ConfChat for improved multi-turn response quality, addressing a gap in existing static calibration methods.
Findings
MTCal minimizes ECE@T effectively across turns.
ConfChat enhances factuality and consistency in multi-turn interactions.
Multi-turn calibration is crucial for safe and reliable LLM deployment.
Abstract
Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
