SNOW: Spatio-Temporal Scene Understanding with World Knowledge for Open-World Embodied Reasoning
Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

TL;DR
SNOW is a novel framework that combines vision-language models with geometric and temporal data to create a unified 4D scene understanding system for autonomous robots, enabling better reasoning in dynamic environments.
Contribution
SNOW introduces a training-free, backbone-agnostic method that integrates multimodal tokens into a 4D scene graph for improved embodied reasoning.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Provides accurate 4D scene representations for autonomous navigation.
Enables direct interpretation of spatial and temporal scene structure.
Abstract
Autonomous robotic systems require spatio-temporal understanding of dynamic environments to ensure reliable navigation and interaction. While Vision-Language Models (VLMs) provide open-world semantic priors, they lack grounding in 3D geometry and temporal dynamics. Conversely, geometric perception captures structure and motion but remains semantically sparse. We propose SNOW (Scene Understanding with Open-World Knowledge), a training-free and backbone-agnostic framework for unified 4D scene understanding that integrates VLM-derived semantics with point cloud geometry and temporal consistency. SNOW processes synchronized RGB images and 3D point clouds, using HDBSCAN clustering to generate object-level proposals that guide SAM2-based segmentation. Each segmented region is encoded through our proposed Spatio-Temporal Tokenized Patch Encoding (STEP), producing multimodal tokens that capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Robot Manipulation and Learning
