StreamingClaw Technical Report

Jiawei Chen; Zhe Chen; Chaoqun Du; Maokui He; Wei He; Hengtao Li; Qizhen Li; Zide Liu; Hao Ma; Xuhao Pan; Chang Ren; Xudong Rao; Xintian Shen; Chenfeng Wang; Tao Wei; Chengjun Yu; Pengfei Yu; Shengyu Yao; Chunpeng Zhou; Kun Zhan; Lihao Zheng; Pan Zhou; Xuhan Zhu; Yufei Zheng

arXiv:2603.22120·cs.CV·March 27, 2026

StreamingClaw Technical Report

Jiawei Chen, Zhe Chen, Chaoqun Du, Maokui He, Wei He, Hengtao Li, Qizhen Li, Zide Liu, Hao Ma, Xuhao Pan, Chang Ren, Xudong Rao, Xintian Shen, Chenfeng Wang, Tao Wei, Chengjun Yu, Pengfei Yu, Shengyu Yao, Chunpeng Zhou, Kun Zhan, Lihao Zheng, Pan Zhou, Xuhan Zhu, Yufei Zheng

PDF

Open Access

TL;DR

StreamingClaw is a comprehensive framework enabling real-time, multimodal streaming video understanding and embodied intelligence, addressing key limitations of existing agents in dynamic real-world environments.

Contribution

It introduces a unified agent framework supporting real-time reasoning, long-term multimodal memory, and closed-loop perception-decision-action, enhancing capabilities for embodied intelligence.

Findings

01

Supports real-time streaming reasoning and proactive interaction.

02

Enables multimodal long-term memory storage and sharing.

03

Compatible with open-source frameworks like OpenClaw.

Abstract

Emerging applications such as embodied intelligence, AI hardware, autonomous driving, and intelligent cockpits rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents mostly suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming input. These shortcomings have become a key bottleneck for preventing agents from sustaining perception, making real-time decisions, and executing closed-loop actions in complex real-world environments, constraining their deployment and potential in dynamic, open physical worlds. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Speech and dialogue systems