SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Tin Stribor Sohn; Maximilian Dillitzer; Jason J. Corso; Eric Sax

arXiv:2512.16461·cs.CV·December 19, 2025

SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

PDF

Open Access

TL;DR

SNOW is a novel framework that combines vision-language models with geometric and temporal data to create a unified 4D scene understanding system for autonomous robots, enabling better reasoning in dynamic environments.

Contribution

SNOW introduces a training-free, backbone-agnostic method that integrates multimodal tokens into a 4D scene graph for improved embodied reasoning.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Provides accurate 4D scene representations for autonomous navigation.

03

Enables direct interpretation of spatial and temporal scene structure.

Abstract

Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Robot Manipulation and Learning