VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

Yifei Yu; Xiaoshan Wu; Xinting Hu; Tao Hu; Yangtian Sun; Xiaoyang Lyu; Bo Wang; Lin Ma; Yuewen Ma; Zhongrui Wang; Xiaojuan Qi

arXiv:2512.04519·cs.CV·December 5, 2025

VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, Xiaojuan Qi

PDF

Open Access

TL;DR

VideoSSM introduces a hybrid memory model combining state-space and local context to enable coherent, interactive, long-duration video generation with improved temporal consistency and motion stability.

Contribution

The paper presents a novel hybrid state-space memory architecture that unifies autoregressive diffusion with global and local memory for scalable long video synthesis.

Findings

01

Achieves state-of-the-art temporal consistency on long videos

02

Supports prompt-adaptive interaction and content diversity

03

Scales linearly with sequence length for efficient generation

Abstract

Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short-…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Advanced Vision and Imaging