Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur, Charese H. Smiley, Ivan Brugere, Kundan Thind, and Mohammad M. Ghassemi

TL;DR
This paper introduces BICR, a confidence estimation framework for LVLMs that explicitly distinguishes between grounded and ungrounded predictions by contrasting real and blacked-out images, improving reliability detection.
Contribution
BICR is a novel, model-agnostic method that trains a lightweight probe to assess visual grounding reliability without extra inference cost.
Findings
BICR achieves superior calibration and discrimination across five LVLMs.
It outperforms existing baselines with 4-18x fewer parameters.
BICR is effective across diverse tasks like VQA, hallucination detection, and medical imaging.
Abstract
Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model behavior under normal inference with no mechanism to determine whether a prediction was shaped by the image or by text alone. We introduce BICR (Blind-Image Contrastive Ranking), a model-agnostic confidence estimation framework that makes this contrast explicit during training by extracting hidden states from a frozen LVLM twice: once with the real image-question pair, and once with the image blacked out while the question is held fixed. A lightweight probe is trained on the real-image hidden state and regularized by a ranking loss that penalizes higher confidence on the blacked-out view,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
