Lost in Space? Vision-Language Models Struggle with Relative Camera Pose Estimation
Ken Deng, Yifu Qiu, Yoni Kasten, Shay B. Cohen, Yftah Ziser

TL;DR
This paper evaluates vision-language models' ability to estimate relative camera pose from image pairs, revealing significant gaps compared to humans and geometric methods, and identifies specific missing capabilities.
Contribution
It introduces VRRPI-Bench and VRRPI-Diag benchmarks for assessing multi-view spatial reasoning in VLMs, highlighting their limitations in cross-view correspondence and view-consistent reasoning.
Findings
Humans and geometric pipelines outperform VLMs significantly.
VLMs perform near random on multi-view spatial reasoning tasks.
Failures pinpoint missing capabilities like cross-view correspondence.
Abstract
We study whether vision-language models (VLMs) can solve relative camera pose estimation (RCPE) from image pairs, a direct test of multi-view spatial reasoning. We cast RCPE as a discrete verbal classification task and introduce \texttt{VRRPI-Bench}, built from real RGB-D frames with object-centric camera motion, and \texttt{VRRPI-Diag}, which isolates individual motion degrees of freedom. Humans (0.91) and specialized geometric pipelines such as LoFTR (0.99) solve the task reliably, yet the best VLM reaches only 0.66 and most others remain near random. Our analyses show that this gap is not basic spatial competence: strong VLMs are near ceiling on single-image benchmarks, but most remain near random once reasoning must span views. They are unstable under source-target reversal (best 59.7\% consistency) and remain weak even in simplified single-DoF settings, especially on optical-axis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
