Overconfidence is Key: Verbalized Uncertainty Evaluation in Large Language and Vision-Language Models
Tobias Groot, Matias Valdenegro-Toro

TL;DR
This paper evaluates the ability of large language and vision-language models to estimate their own uncertainty, revealing they are generally overconfident and poorly calibrated, especially in complex tasks.
Contribution
It introduces the Japanese Uncertain Scenes dataset and Net Calibration Error metric, providing new tools to assess and improve model uncertainty estimation.
Findings
Models exhibit high calibration error and overconfidence.
VLMs perform poorly in estimating uncertainty for regression tasks.
Proposed prompts improve uncertainty estimation in some cases.
Abstract
Language and Vision-Language Models (LLMs/VLMs) have revolutionized the field of AI by their ability to generate human-like text and understand images, but ensuring their reliability is crucial. This paper aims to evaluate the ability of LLMs (GPT4, GPT-3.5, LLaMA2, and PaLM 2) and VLMs (GPT4V and Gemini Pro Vision) to estimate their verbalized uncertainty via prompting. We propose the new Japanese Uncertain Scenes (JUS) dataset, aimed at testing VLM capabilities via difficult queries and object counting, and the Net Calibration Error (NCE) to measure direction of miscalibration. Results show that both LLMs and VLMs have a high calibration error and are overconfident most of the time, indicating a poor capability for uncertainty estimation. Additionally we develop prompts for regression tasks, and we show that VLMs have poor calibration when producing mean/standard deviation and 95%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Natural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Dropout · Residual Connection · Softmax · Byte Pair Encoding · {Dispute@FaQ-s}How to file a dispute with Expedia? · Linear Layer · Adam
