Predicting When to Trust Vision-Language Models for Spatial Reasoning
Muhammad Imran, Yugyung Lee

TL;DR
This paper introduces a vision-based confidence estimation framework for VLMs that predicts when to trust spatial reasoning outputs, significantly improving reliability and safety in applications like robotics.
Contribution
The authors develop a geometric verification method combining multiple signals to accurately predict VLM spatial reasoning trustworthiness, outperforming text-based confidence approaches.
Findings
Achieves 0.674 AUROC on BLIP-2, 0.583 on CLIP, outperforming baselines.
Enables selective prediction with 61.9% coverage at 60% accuracy.
Improves scene graph precision from 52.1% to 78.3% through confidence-based pruning.
Abstract
Vision-Language Models (VLMs) demonstrate impressive capabilities across multimodal tasks, yet exhibit systematic spatial reasoning failures, achieving only 49% (CLIP) to 54% (BLIP-2) accuracy on basic directional relationships. For safe deployment in robotics and autonomous systems, we need to predict when to trust VLM spatial predictions rather than accepting all outputs. We propose a vision-based confidence estimation framework that validates VLM predictions through independent geometric verification using object detection. Unlike text-based approaches relying on self-assessment, our method fuses four signals via gradient boosting: geometric alignment between VLM claims and coordinates, spatial ambiguity from overlap, detection quality, and VLM internal uncertainty. We achieve 0.674 AUROC on BLIP-2 (34.0% improvement over text-based baselines) and 0.583 AUROC on CLIP (16.1%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Autonomous Vehicle Technology and Safety
