Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space
Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, Xue Yang

TL;DR
AdapTok is a novel adaptive video tokenizer that dynamically allocates tokens based on content and temporal cues, improving efficiency and quality in video reconstruction and generation.
Contribution
We introduce AdapTok, a flexible, content-aware, and temporally adaptive video tokenization method with a novel block-wise masking and token allocation strategy.
Findings
Improves video reconstruction quality across different token budgets.
Enhances video generation performance without extra image data.
Demonstrates effectiveness on UCF-101 and Kinetics-600 datasets.
Abstract
We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- This work is well motivated with the purpose of modeling videos using variable numbers of tokens given its content complexity. - The proposed tokenizer is well designed to dynamically drop out tokens in block level. The generator is also able to generate variable length of latent with <EOB> tags.
- Table 1 doesn't include the model size (params) of the tokenizers. Table 2 doesn't indicate which token number configuration is used from Table 1 (I'd assume it's the 2048 tokens one). Some most recent methods are not included, e.g. [1]. It's OK if the final performance doesn't surpass the previous SOTA, as long as it reaches comparable performance with more efficient models/tokens etc., as highlighted by the motivation of this paper (e.g. average generation tokens across all categories, minim
1. The paper is well organized. 2. The topic of encoding different frames within the same video using varying numbers of tokens is worth exploring in the research community. This paper targets this important problem. 3. The paper provides a lot of figures to help reviewers better understand of the proposed method.
There are some concerns and questions about this paper: 1. The authors mention in the introduction that VAEs need to possess the characteristic of temporal causality, but a significant drawback of this characteristic is error accumulation. How do the authors address this problem? 2. I also have some questions about the 1-D latent token space characteristic. It's well known that a 1D latent space is very suitable for sequences like speech, but for non-1D input signals such as images and videos,
1. Clear and practically relevant problem formulation: Video data exhibits substantial spatiotemporal redundancy, and fixed-length tokenization is inefficient. AdapTok is the first to achieve globally optimal adaptive token allocation within a causal, 1D latent space, aligning well with real-world demands for efficient video modeling. 2. Novel and cohesive technical design: The integration of block-wise tail dropping during training, a block-causal scorer, and ILP-based inference (IPAL) forms a
1. Lack of fair comparison with non-causal adaptive methods: Without comparing against high-performing non-causal tokenizers (e.g., MAGVIT-v2) under the same token budget, it remains unclear whether the causal constraint incurs a performance penalty. 2. Limited novelty in sampling strategy: The block-wise tail-dropping mechanism appears conceptually similar to prior works such as DC-AE 1.5 and FlexTok, which somewhat weakens the claimed technical novelty. 3. Scalability concerns regarding ILP:
1. The integration of a block causal scorer and ILP-based IPAL strategy addresses the limitation of fixed token budgets in prior work (e.g., ElasticTok), enabling global optimization of token usage across samples and temporal dynamics. This is supported by strong empirical results showing Pareto optimality between performance and token count. 2. By enforcing causal attention across blocks and decoupling token allocation from spatial structure, AdapTok supports online streaming processing and av
The manuscript exhibits several critical limitations in its methodology, theoretical justification, and experimental rigor, which undermine the validity of its claims and the interpretability of its contributions. Below is a detailed expansion of these weaknesses: 1. Oversight of MLLM Video Tokenization Literature. The authors position AdapTok as addressing a "gap" in efficient video tokenization, yet they overlook a rich body of work on video tokenization in multimodal large language models (M
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Advanced Image Processing Techniques
