VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning
Wenyi Xiao, Xinchi Xu, Leilei Gan

TL;DR
This paper introduces VL-Calibration, a reinforcement learning framework that improves confidence calibration and reasoning accuracy in large vision-language models by decoupling visual and reasoning confidence.
Contribution
It proposes a novel decoupled confidence calibration method for LVLMs, addressing hallucinations and improving both calibration and reasoning performance.
Findings
VL-Calibration improves calibration across thirteen benchmarks.
It enhances visual reasoning accuracy in LVLMs.
The method generalizes across different model scales and architectures.
Abstract
Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
