Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames

Sahithya Ravi; Gabriel Sarch; Vibhav Vineet; Andrew D. Wilson; Balasaravanan Thoravi Kumaravel

arXiv:2505.24257·cs.CV·June 2, 2025

Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames

Sahithya Ravi, Gabriel Sarch, Vibhav Vineet, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel

PDF

Open Access 1 Video

TL;DR

This paper introduces Disjoint-3DQA, a benchmark for evaluating egocentric spatial reasoning in vision-language models across disjoint frames, revealing current models' limitations in constructing 3D scene representations over time.

Contribution

The paper presents a new benchmark, Disjoint-3DQA, to assess long-horizon spatial reasoning in VLMs and analyzes their performance, highlighting the importance of 3D scene understanding.

Findings

01

Models lag 28% behind human performance.

02

Accuracy drops from 60% to 30% as temporal gaps increase.

03

Oracle 3D coordinates improve performance by 20%.

Abstract

An embodied AI assistant operating on egocentric video must integrate spatial cues across time - for instance, determining where an object A, glimpsed a few moments ago lies relative to an object B encountered later. We introduce Disjoint-3DQA , a generative QA benchmark that evaluates this ability of VLMs by posing questions about object pairs that are not co-visible in the same frame. We evaluated seven state-of-the-art VLMs and found that models lag behind human performance by 28%, with steeper declines in accuracy (60% to 30 %) as the temporal gap widens. Our analysis further reveals that providing trajectories or bird's-eye-view projections to VLMs results in only marginal improvements, whereas providing oracle 3D coordinates leads to a substantial 20% performance increase. This highlights a core bottleneck of multi-frame VLMs in constructing and maintaining 3D scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames· underline

Taxonomy

TopicsConstraint Satisfaction and Optimization