Towards Statistical Factuality Guarantee for Large Vision-Language Models
Zhuohang Li, Chao Yan, Nicholas J. Jackson, Wendi Cui, Bo Li, Jiaxin, Zhang, Bradley A. Malin

TL;DR
This paper introduces ConfLVLM, a conformal prediction framework that provides statistical guarantees to reduce hallucinations in large vision-language models, ensuring more reliable image-conditioned text generation.
Contribution
The paper proposes a novel conformal prediction-based method to verify and filter claims in LVLM outputs, offering finite-sample guarantees on factuality.
Findings
ConfLVLM reduces claim error rate from 87.8% to 10.0%.
Achieves 95.3% true positive rate in filtering unreliable claims.
Applicable to any black-box LVLM with various uncertainty measures.
Abstract
Advancements in Large Vision-Language Models (LVLMs) have demonstrated promising performance in a variety of vision-language tasks involving image-conditioned free-form text generation. However, growing concerns about hallucinations in LVLMs, where the generated text is inconsistent with the visual context, are becoming a major impediment to deploying these models in applications that demand guaranteed reliability. In this paper, we introduce a framework to address this challenge, ConfLVLM, which is grounded on conformal prediction to achieve finite-sample distribution-free statistical guarantees on the factuality of LVLM output. This framework treats an LVLM as a hypothesis generator, where each generated text detail (or claim) is considered an individual hypothesis. It then applies a statistical hypothesis testing procedure to verify each claim using efficient heuristic uncertainty…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
