OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie

TL;DR
OmniStream is a unified, real-time streaming visual backbone that integrates perception, reconstruction, and action capabilities, enabling versatile and general-purpose visual understanding in continuous video streams.
Contribution
The paper introduces OmniStream, a novel causal spatiotemporal attention-based model trained on multiple tasks and datasets, achieving broad generalization without specialized fine-tuning.
Findings
Competitive performance across diverse tasks and datasets
Effective online processing with a frozen backbone
Demonstrates potential for general-purpose visual agents
Abstract
Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
