Autoregressive Video Autoencoder with Decoupled Temporal and Spatial Context
Cuifeng Shen, Lumin Xu, Xingguo Zhu, Gengdai Liu

TL;DR
This paper introduces ARVAE, a novel autoregressive video autoencoder that decouples temporal and spatial information, enabling efficient, high-quality video compression and reconstruction with potential for improved video generation tasks.
Contribution
ARVAE is the first to decouple temporal and spatial features in video autoencoding, enhancing compression efficiency and reconstruction quality with lightweight models.
Findings
Achieves superior reconstruction quality with lightweight models.
Demonstrates strong potential for video generation applications.
Effective decoupling of temporal and spatial information improves performance.
Abstract
Video autoencoders compress videos into compact latent representations for efficient reconstruction, playing a vital role in enhancing the quality and efficiency of video generation. However, existing video autoencoders often entangle spatial and temporal information, limiting their ability to capture temporal consistency and leading to suboptimal performance. To address this, we propose Autoregressive Video Autoencoder (ARVAE), which compresses and reconstructs each frame conditioned on its predecessor in an autoregressive manner, allowing flexible processing of videos with arbitrary lengths. ARVAE introduces a temporal-spatial decoupled representation that combines downsampled flow field for temporal coherence with spatial relative compensation for newly emerged content, achieving high compression efficiency without information loss. Specifically, the encoder compresses the current…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The primary strength of this paper lies in its reported quantitative results. As shown in Table 1, the proposed ARVAE achieves state-of-the-art or highly competitive performance on PSNR, SSIM, and LPIPS metrics across multiple benchmarks, while claiming to use a fraction of the model parameters and training data of strong baselines like Step-Video and LTX-Video. If these results are validated to be under fair comparison, they would be quite significant. 2. The autoregressive approach, which l
[Poorly Justified Motivation] The paper's motivation is weak and questionable. The authors claim that existing methods "model the video with spatial and temporal information intermingled, failing to effectively capture the temporal consistency" (Lines 41-42), and they single out "3D attention mechanisms" as the target of criticism. This is problematic for two main reasons: First, 3D attention is not a dominant component in most state-of-the-art video VAEs; 3D convolutions are far more prevalent.
1. ARVAE is motivated to address the temporal consistency in video generation, through a decoupled video representation composed of spatial and motion components. Intuitively speaking, this strategy is capable of preserving spatial details of previous frames. 2. ARVAE shows strong parameter efficiency, achieving strong reconstruction performance compared to video autoencoders with 100x larger parameter scale. 3. Ablation experiments show the effectiveness of using multi-scale propagated featu
1. Though reconstruction is an important evalaution aspect for autoencoders, it is even more important to evaluate whether the latent is good for further generation process. Yet the manuscript provides limited evaluation on the generation side. First, since the representations are decoupled, it is unclear how the spatial and motion representations are used during the generation process. Second, the comparison is limited to the comparison between system-level generation results, with no controlle
- Clear Motivation and Novelty: The paper addresses a long-standing challenge in video autoencoding: entangled spatial-temporal modeling, and propose temporal-spatial decoupling is a conceptually clean and potentially impactful approach. - Autoregressive Design: Autoregressive frame prediction aligns well with the temporal nature of video data and allows for variable-length sequence modeling, which is an advantage over fixed-length VAEs.
- Lack of theoretical insight into key design choices: The paper proposes a decoupled latent representation that separates temporal motion from spatial content. While this design is empirically effective, the underlying rationale is not well articulated. It remains unclear why this separation leads to better generalization or compression efficiency compared to joint modeling. A deeper conceptual or theoretical discussion would strengthen the paper’s contribution. - Efficiency claims are insuff
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Coding and Compression Technologies · Human Pose and Action Recognition
