R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

Tin Stribor Sohn; Maximilian Dillitzer; Jason J. Corso; Eric Sax

arXiv:2512.15940·cs.CV·December 19, 2025

R4: Retrieval-Augmented Reasoning for Vision-Language Models in 4D Spatio-Temporal Space

Tin Stribor Sohn, Maximilian Dillitzer, Jason J. Corso, Eric Sax

PDF

Open Access

TL;DR

R4 introduces a training-free, retrieval-augmented framework for vision-language models that constructs a persistent 4D spatio-temporal knowledge database, enabling improved reasoning in dynamic environments without additional training.

Contribution

This work presents R4, a novel 4D retrieval-augmented reasoning framework for VLMs that operates directly in spatio-temporal space, enhancing dynamic environment understanding.

Findings

01

Significant improvement in embodied question answering accuracy.

02

Enhanced navigation performance in dynamic environments.

03

Effective retrieval of 4D spatio-temporal observations.

Abstract

Humans perceive and reason about their surroundings in four dimensions by building persistent, structured internal representations that encode semantic meaning, spatial layout, and temporal dynamics. These multimodal memories enable them to recall past events, infer unobserved states, and integrate new information into context-dependent reasoning. Inspired by this capability, we introduce R4, a training-free framework for retrieval-augmented reasoning in 4D spatio-temporal space that equips vision-language models (VLMs) with structured, lifelong memory. R4 continuously constructs a 4D knowledge database by anchoring object-level semantic descriptions in metric space and time, yielding a persistent world model that can be shared across agents. At inference, natural language queries are decomposed into semantic, spatial, and temporal keys to retrieve relevant observations, which are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Spatial Cognition and Navigation