RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics
Yuzhi Huang, Jie Wu, Weijue Bu, Ziyi Xiong, Gaoyang Jiang, Ye Li, Kangye Ji, Shuzhao Xie, Yue Huang, Chenglei Wu, Jingyan Jiang, Zhi Wang

TL;DR
RoboStream introduces a novel framework for robotic manipulation that incorporates persistent spatio-temporal reasoning and causal memory, significantly improving long-horizon task performance without additional training.
Contribution
It proposes RoboStream, a training-free approach using Spatio-Temporal Fusion Tokens and a Causal Spatio-Temporal Graph to enable persistent object grounding and causal reasoning in vision-language models for robotics.
Findings
Achieves 90.5% success on RLBench long-horizon tasks
Attains 44.4% success on real-world block-building tasks
Outperforms baseline methods like SoFar and VoxPoser significantly
Abstract
Enabling reliable long-horizon robotic manipulation is a crucial step toward open-world embodied intelligence. However, VLM-based planners treat each step as an isolated observation-to-action mapping, forcing them to reinfer scene geometry from raw pixels at every decision point while remaining unaware of how prior actions have reshaped the environment. Despite strong short-horizon performance, these systems lack the spatio-temporal reasoning required for persistent geometric anchoring and memory of action-triggered state transitions. Without persistent state tracking, perceptual errors accumulate across the execution horizon, temporarily occluded objects are catastrophically forgotten, and these compounding failures lead to precondition violations that cascade through subsequent steps. In contrast, humans maintain a persistent mental model that continuously tracks spatial relations and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
