TL;DR
This study systematically evaluates how well large vision-language models recognize their knowledge limits, comparing confidence signals and calibration methods to improve their self-awareness in visual question answering tasks.
Contribution
The paper introduces methods to calibrate confidence signals in LVLMs and compares their perception abilities with LLMs, highlighting the impact of joint visual-text processing.
Findings
Probabilistic and consistency-based confidences are more reliable.
Verbalized confidence often causes overconfidence.
Calibration methods improve LVLMs' perception of their knowledge boundaries.
Abstract
Large vision-language models (LVLMs) demonstrate strong visual question answering (VQA) capabilities but are shown to hallucinate. A reliable model should perceive its knowledge boundaries-knowing what it knows and what it does not. This paper investigates LVLMs' perception of their knowledge boundaries by evaluating three types of confidence signals: probabilistic confidence, answer consistency-based confidence, and verbalized confidence. Experiments on three LVLMs across three VQA datasets show that, although LVLMs possess a reasonable perception level, there is substantial room for improvement. Among the three confidences, probabilistic and consistency-based signals are more reliable indicators, while verbalized confidence often leads to overconfidence. To enhance LVLMs' perception, we adapt several established confidence calibration methods from Large Language Models (LLMs) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
