Autoregressive Video Generation without Vector Quantization
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan, Lu, Shiguang Shan, Yonggang Qi, Xinlong Wang

TL;DR
This paper introduces NOVA, a non-quantized autoregressive video generation model that achieves high efficiency, superior quality, and versatility, outperforming previous models and diffusion methods with fewer parameters and lower training costs.
Contribution
The paper presents NOVA, a novel autoregressive video model that avoids vector quantization, combining causal and bidirectional modeling for improved efficiency and performance.
Findings
NOVA surpasses prior autoregressive models in data efficiency and visual quality.
NOVA outperforms state-of-the-art diffusion models in text-to-image tasks.
NOVA generalizes well across longer videos and zero-shot applications.
Abstract
This paper presents a novel approach that enables autoregressive video generation with high efficiency. We propose to reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. Unlike raster-scan prediction in prior autoregressive models or joint distribution modeling of fixed-length tokens in diffusion models, our approach maintains the causal property of GPT-style models for flexible in-context capabilities, while leveraging bidirectional modeling within individual frames for efficiency. With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA. Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity,…
Peer Reviews
Decision·ICLR 2025 Poster
- As a follower of MAR (Tianhong Li et al. (2024b)), this paper for the first time lifts the non-quantized AR model to video generation. In contrast to trivially modifying the 2D non-quantized MAR to a 3D version, they design the autoregressive modeling sequentially that integrates first temporal frame-by-frame prediction and then spatial set-by-set within each frame. This facilitates the model's ability of video extrapolation and potential compatibility with kv-cache acceleration. - The model
- Unclear training/inference details. 1. According to Figure 1. At training time, the model predicts a set of masked tokens of the 2nd frame. At inference time, the model progressively reduces the masked ratio from 1.0 to 0. However, as the 1st and 2nd frames have been generated (as the given conditional frames) in the Fig.1's example, the model should progressively unmask the 3rd frame. There seems to be some inconsistency between training and inference. In other words, for the example in Fi
* NOVA achieves state-of-the-art (SOTA) results in text-to-image (T2I) tasks. * NOVA shows much faster inference speeds than previous video generative models. * The method of combining temporal autoregressive and spatial bidirectional modeling is simple yet effective. * The Scaling and Shift Layer is also simple but effective. Also, the analysis of the layer is comprehensive.
* While NOVA achieves SOTA in T2I, this aspect feels like a straightforward extension of MAR[1] rather than a novel contribution. * For text-to-video (T2V), NOVA uses relatively less data and fewer parameters and has fast inference speeds but falls short in performance. Therefore, it needs further testing about scalability (i.e., if NOVA can match the performance of the open-source models in the main table when scaled up.). * There is a question about whether extrapolation is truly unique to au
1) NOVA's framework is well-structured, combining temporal and spatial autoregressive modeling. This dual approach not only enhances the model's efficiency but also its ability to handle multiple generative tasks within a single model, showcasing the potential for in-context learning. 2) The authors provide a thorough evaluation of NOVA, comparing it with state-of-the-art models across various metrics. The results demonstrate that NOVA not only matches but often surpasses the performance of dif
I think the key limitation of this work is the novelty, which seems like an extension of MAR on video generation task. In Table 3, I don't see improvements in the proposed on the basis of previous diffusion-based methods. Are AR-based methods really needed for video generation tasks? Could the author clarify this?
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Optical Imaging Technologies · Computer Graphics and Visualization Techniques
MethodsDiffusion
