Predicting When to Trust Vision-Language Models for Spatial Reasoning

Muhammad Imran; Yugyung Lee

arXiv:2601.11644·cs.CV·January 21, 2026

Predicting When to Trust Vision-Language Models for Spatial Reasoning

Muhammad Imran, Yugyung Lee

PDF

Open Access

TL;DR

This paper introduces a vision-based confidence estimation framework for VLMs that predicts when to trust spatial reasoning outputs, significantly improving reliability and safety in applications like robotics.

Contribution

The authors develop a geometric verification method combining multiple signals to accurately predict VLM spatial reasoning trustworthiness, outperforming text-based confidence approaches.

Findings

01

Achieves 0.674 AUROC on BLIP-2, 0.583 on CLIP, outperforming baselines.

02

Enables selective prediction with 61.9% coverage at 60% accuracy.

03

Improves scene graph precision from 52.1% to 78.3% through confidence-based pruning.

Abstract

Vision-Language Models (VLMs) demonstrate impressive capabilities across multimodal tasks, yet exhibit systematic spatial reasoning failures, achieving only 49% (CLIP) to 54% (BLIP-2) accuracy on basic directional relationships. For safe deployment in robotics and autonomous systems, we need to predict when to trust VLM spatial predictions rather than accepting all outputs. We propose a vision-based confidence estimation framework that validates VLM predictions through independent geometric verification using object detection. Unlike text-based approaches relying on self-assessment, our method fuses four signals via gradient boosting: geometric alignment between VLM claims and coordinates, spatial ambiguity from overlap, detection quality, and VLM internal uncertainty. We achieve 0.674 AUROC on BLIP-2 (34.0% improvement over text-based baselines) and 0.583 AUROC on CLIP (16.1%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Autonomous Vehicle Technology and Safety