OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

Yibin Yan; Jilan Xu; Shangzhe Di; Haoning Wu; Weidi Xie

arXiv:2603.12265·cs.CV·March 13, 2026

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie

PDF

Open Access 1 Models

TL;DR

OmniStream is a unified, real-time streaming visual backbone that integrates perception, reconstruction, and action capabilities, enabling versatile and general-purpose visual understanding in continuous video streams.

Contribution

The paper introduces OmniStream, a novel causal spatiotemporal attention-based model trained on multiple tasks and datasets, achieving broad generalization without specialized fine-tuning.

Findings

01

Competitive performance across diverse tasks and datasets

02

Effective online processing with a frozen backbone

03

Demonstrates potential for general-purpose visual agents

Abstract

Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
StreamFormer/OmniStream
model· 208 dl· ♡ 2
208 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition