Confidence Estimation for LLM-Based Dialogue State Tracking

Yi-Jyun Sun; Suvodip Dey; Dilek Hakkani-Tur; Gokhan Tur

arXiv:2409.09629·cs.CL·September 24, 2024

Confidence Estimation for LLM-Based Dialogue State Tracking

Yi-Jyun Sun, Suvodip Dey, Dilek Hakkani-Tur, Gokhan Tur

PDF

Open Access 1 Repo

TL;DR

This paper explores methods to estimate and calibrate confidence scores in large language models for dialogue state tracking, aiming to improve reliability and reduce hallucinations in conversational AI.

Contribution

It provides a comprehensive evaluation of confidence estimation techniques for LLMs in dialogue systems, including novel self-probing methods for closed models and fine-tuning strategies for open models.

Findings

01

Fine-tuning open-weight LLMs improves confidence calibration.

02

Self-probing enhances confidence estimation for closed models.

03

Better calibration correlates with higher joint goal accuracy.

Abstract

Estimation of a model's confidence on its outputs is critical for Conversational AI systems based on large language models (LLMs), especially for reducing hallucination and preventing over-reliance. In this work, we provide an exhaustive exploration of methods, including approaches proposed for open- and closed-weight LLMs, aimed at quantifying and leveraging model uncertainty to improve the reliability of LLM-generated responses, specifically focusing on dialogue state tracking (DST) in task-oriented dialogue systems (TODS). Regardless of the model type, well-calibrated confidence scores are essential to handle uncertainties, thereby improving model performance. We evaluate four methods for estimating confidence scores based on softmax, raw token scores, verbalized confidences, and a combination of these methods, using the area under the curve (AUC) metric to assess calibration, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jennycs0830/confidence_score_dst
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsDynamic Sparse Training