Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames
Sahithya Ravi, Gabriel Sarch, Vibhav Vineet, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel

TL;DR
This paper introduces Disjoint-3DQA, a benchmark for evaluating egocentric spatial reasoning in vision-language models across disjoint frames, revealing current models' limitations in constructing 3D scene representations over time.
Contribution
The paper presents a new benchmark, Disjoint-3DQA, to assess long-horizon spatial reasoning in VLMs and analyzes their performance, highlighting the importance of 3D scene understanding.
Findings
Models lag 28% behind human performance.
Accuracy drops from 60% to 30% as temporal gaps increase.
Oracle 3D coordinates improve performance by 20%.
Abstract
An embodied AI assistant operating on egocentric video must integrate spatial cues across time - for instance, determining where an object A, glimpsed a few moments ago lies relative to an object B encountered later. We introduce Disjoint-3DQA , a generative QA benchmark that evaluates this ability of VLMs by posing questions about object pairs that are not co-visible in the same frame. We evaluated seven state-of-the-art VLMs and found that models lag behind human performance by 28%, with steeper declines in accuracy (60% to 30 %) as the temporal gap widens. Our analysis further reveals that providing trajectories or bird's-eye-view projections to VLMs results in only marginal improvements, whereas providing oracle 3D coordinates leads to a substantial 20% performance increase. This highlights a core bottleneck of multi-frame VLMs in constructing and maintaining 3D scene…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsConstraint Satisfaction and Optimization
