LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior
Hanyu Wang, Saksham Suri, Yixuan Ren, Hao Chen, Abhinav Shrivastava

TL;DR
LARP introduces a holistic video tokenization method with learned queries and an autoregressive prior, enabling more global representations and improved video generation performance.
Contribution
It proposes a novel holistic tokenization scheme with learned queries and integrates an autoregressive prior for better video modeling.
Findings
Achieves state-of-the-art FVD on UCF101 benchmark.
Supports flexible, adaptive tokenization for various tasks.
Enhances AR models' compatibility with video data.
Abstract
We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its…
Peer Reviews
Decision·ICLR 2025 Oral
1. This paper introduces a novel method for video tokenization that moves beyond traditional patchwise encoding, making it both interesting and innovative. 2. The paper provides a comprehensive set of experiments, including video reconstruction, class-conditional video generation, and video frame prediction, effectively demonstrating LARP's capabilities across different tasks. 3. LARP achieves state-of-the-art FVD scores on benchmarks like UCF101, indicating that it is a competitive approach i
1. The overall computational cost and complexity of training LARP, especially with the AR prior model, may be a concern. 2. It remains unclear how stable the training process is over longer periods or under different training regimes. 3. The paper may not sufficiently discuss scenarios in which LARP underperforms or fails, which is essential for understanding the model’s limitations.
* The proposed method addresses an interesting and important topic with great potential for multimodal LLMs - how to tokenize videos into a sequence of tokens more suitable for LLM learning. * The introduced AR prior model is a simple yet effective method to produce tokens more friendly for autoregressive generation. * Experiment results demonstrate the effectiveness of the proposed method in reducing the gap between generation quality and reconstruction quality, highlighting better learnability
* While results have been presented on class-conditional generation and frame prediction tasks, the benefit or penalty from using the proposed method on other tasks such as video editing and stylization remains unclear. * While not being applied to videos before, the holistic token approach appears similar to [1] on images. The differences other than the AR prior part need to be clarified. [1] Yu et al. An Image is Worth 32 Tokens for Reconstruction and Generation. arXiv 2406.07550
- This paper presented an inspiring message that properly considering the downstream video generation tasks when training video tokenizers can result in much better generation performance. - Extensive experiments and ablations demonstrated the effectiveness of incorporating the autoregressive prior to tokenizer training.
- There are several typos in the paper: - line 270: “back to the continues” -> “back to the continuous” - line 459: “hilighting” → “highlighting” - More ablation studies can be conducted to analyze the effectiveness of holistic video tokenization and autoregressive generation further, as detailed in the Questions section.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques · Human Motion and Animation
MethodsALIGN · Sparse Evolutionary Training
