Reconsidering LLM Uncertainty Estimation Methods in the Wild
Yavuz Bakman, Duygu Nur Yaldiz, Sungmin Kang, Tuo Zhang, Baturalp Buyukates, Salman Avestimehr, Sai Praneeth Karimireddy

TL;DR
This paper systematically evaluates large language model uncertainty estimation methods in real-world scenarios, highlighting their sensitivities, robustness issues, and potential strategies for improvement in practical deployments.
Contribution
It provides a comprehensive assessment of existing UE methods under realistic conditions, revealing their vulnerabilities and proposing ensemble strategies for better performance.
Findings
Most UE methods are sensitive to threshold selection under distribution shifts.
UE methods are robust to chat history and typos but vulnerable to adversarial prompts.
Ensembling multiple UE scores improves performance significantly.
Abstract
Large Language Model (LLM) Uncertainty Estimation (UE) methods have become a crucial tool for detecting hallucinations in recent years. While numerous UE methods have been proposed, most existing studies evaluate them in isolated short-form QA settings using threshold-independent metrics such as AUROC or PRR. However, real-world deployment of UE methods introduces several challenges. In this work, we systematically examine four key aspects of deploying UE methods in practical settings. Specifically, we assess (1) the sensitivity of UE methods to decision threshold selection, (2) their robustness to query transformations such as typos, adversarial prompts, and prior chat history, (3) their applicability to long-form generation, and (4) strategies for handling multiple UE scores for a single query. Our evaluations on 19 UE methods reveal that most of them are highly sensitive to threshold…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNuclear Engineering Thermal-Hydraulics
