Spatial-Conditioned Reasoning in Long-Egocentric Videos

James Tribble; Hao Wang; Si-En Hong; Chaoyi Zhou; Ashish Bastola; Siyu Huang; and Abolfazl Razi

arXiv:2601.18100·cs.CV·April 9, 2026

Spatial-Conditioned Reasoning in Long-Egocentric Videos

James Tribble, Hao Wang, Si-En Hong, Chaoyi Zhou, Ashish Bastola, Siyu Huang, and Abolfazl Razi

PDF

TL;DR

This paper investigates how explicit spatial signals and depth information enhance vision-language models' ability to understand long egocentric videos for navigation, highlighting trade-offs and improvements in safety-critical tasks.

Contribution

It introduces Sanpo-D, a detailed re-annotation of a dataset, and benchmarks the impact of spatial signals and depth fusion on spatial reasoning in egocentric videos.

Findings

01

Depth-aware representations improve pedestrian detection.

02

Spatial grounding enhances obstruction detection.

03

Trade-off exists between general accuracy and spatial specialization.

Abstract

Long-horizon egocentric video presents significant challenges for visual navigation due to viewpoint drift and the absence of persistent geometric context. Although recent vision-language models perform well on image and short-video reasoning, their spatial reasoning capability in long egocentric sequences remains limited. In this work, we study how explicit spatial signals influence VLM-based video understanding without modifying model architectures or inference procedures. We introduce Sanpo-D, a fine-grained re-annotation of the Google Sanpo dataset, and benchmark multiple VLMs on navigation-oriented spatial queries. To examine input-level inductive bias, we further fuse depth maps with RGB frames and evaluate their impact on spatial reasoning. Our results reveal a trade-off between general-purpose accuracy and spatial specialization, showing that depth-aware and spatially grounded…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.