RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics

Yuzhi Huang; Jie Wu; Weijue Bu; Ziyi Xiong; Gaoyang Jiang; Ye Li; Kangye Ji; Shuzhao Xie; Yue Huang; Chenglei Wu; Jingyan Jiang; Zhi Wang

arXiv:2603.12939·cs.RO·March 16, 2026

RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics

Yuzhi Huang, Jie Wu, Weijue Bu, Ziyi Xiong, Gaoyang Jiang, Ye Li, Kangye Ji, Shuzhao Xie, Yue Huang, Chenglei Wu, Jingyan Jiang, Zhi Wang

PDF

Open Access

TL;DR

RoboStream introduces a novel framework for robotic manipulation that incorporates persistent spatio-temporal reasoning and causal memory, significantly improving long-horizon task performance without additional training.

Contribution

It proposes RoboStream, a training-free approach using Spatio-Temporal Fusion Tokens and a Causal Spatio-Temporal Graph to enable persistent object grounding and causal reasoning in vision-language models for robotics.

Findings

01

Achieves 90.5% success on RLBench long-horizon tasks

02

Attains 44.4% success on real-world block-building tasks

03

Outperforms baseline methods like SoFar and VoxPoser significantly

Abstract

Enabling reliable long-horizon robotic manipulation is a crucial step toward open-world embodied intelligence. However, VLM-based planners treat each step as an isolated observation-to-action mapping, forcing them to reinfer scene geometry from raw pixels at every decision point while remaining unaware of how prior actions have reshaped the environment. Despite strong short-horizon performance, these systems lack the spatio-temporal reasoning required for persistent geometric anchoring and memory of action-triggered state transitions. Without persistent state tracking, perceptual errors accumulate across the execution horizon, temporarily occluded objects are catastrophically forgotten, and these compounding failures lead to precondition violations that cascade through subsequent steps. In contrast, humans maintain a persistent mental model that continuously tracks spatial relations and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI