InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression
Haotian Ye, Qiyuan He, Jiaqi Han, Puheng Li, Jiaojiao Fan, Zekun Hao, Fitsum Reda, Yogesh Balaji, Huayu Chen, Sheng Liu, Angela Yao, James Zou, Stefano Ermon, Haoxiang Wang, Ming-Yu Liu

TL;DR
InfoTok introduces an adaptive, information-theoretic video tokenizer that optimizes token compression based on content complexity, outperforming fixed-rate methods and saving tokens without sacrificing accuracy.
Contribution
The paper presents a novel ELBO-based algorithm for adaptive video tokenization grounded in Shannon's information theory, improving compression efficiency and accuracy.
Findings
Achieves 2.3x compression rates over prior methods.
Saves 20% tokens without performance loss.
Provides a theoretical framework for adaptive tokenization.
Abstract
Accurate and efficient discrete video tokenization is essential for long video sequences processing. Yet, the inherent complexity and variable information density of videos present a significant bottleneck for current tokenizers, which rigidly compress all content at a fixed rate, leading to redundancy or information loss. Drawing inspiration from Shannon's information theory, this paper introduces InfoTok, a principled framework for adaptive video tokenization. We rigorously prove that existing data-agnostic training methods are suboptimal in representation length, and present a novel evidence lower bound (ELBO)-based algorithm that approaches theoretical optimality. Leveraging this framework, we develop a transformer-based adaptive compressor that enables adaptive tokenization. Empirical results demonstrate state-of-the-art compression performance, saving 20% tokens without influence…
Peer Reviews
Decision·ICLR 2026 Oral
- This paper clearly describes the proposed algorithm and is easy to follow. - This paper provides the proofs for the theorems. - The proposed algorithm achieves good performance on various benchmark tests.
- It would be helpful if the paper included a discussion of the limitations of the proposed approach. - Since the method seems general enough to be applied to images, it would be helpful to explain the rationale for limiting the experiments to videos. - typo - L184: an more accurate -> a more accurate
* The paper is well structured, and the mathematical formulation is precise. * It achieves competitive or superior performance using a lightweight and interpretable mechanism. * The authors provide rigorous justifications connecting ELBO with optimal token length, and the derivation is insightful. * The ELBO-based routing and token selection reuse existing encoder-decoder structures and introduce minimal inference cost, which is appealing in practical deployments.
* Is the per-token ELBO computed purely based on the encoder-decoder’s end-to-end reconstruction path? Must this be explicitly introduced during training, or can the method be plugged into any VAE-style model without changes? Specifically, can non-VAE tokenizers be adapted to this framework? * If N_max is small or the compression ratio is extremely low (e.g., β< 0.1), what are the observed effects on stability and convergence? Would the KL term dominate or explode in such settings, and does it i
1) This paper provides some much-needed theoretical grounding to the adaptive tokenizer space, and attempts to pin down what `optimal’ adaptive compression should look like. 2) Experiments are very comprehensive and clearly demonstrate that InfoTok achieves better reconstruction at the same compression rate compared to other adaptive tokenizers. 3) While the results are incremental on their own, the entire paper is well structured and motivated, and properly places all the contributions in con
1) Key definitions are not well highlighted in the text, and the mathematical sections are poorly written and hard to follow. Since definitions are missing from the theorems, the explanation is rather hard to follow. For example, the entropy H is not clearly defined, \mathbb{D} not defined. Similarly r(N | x) not defined in Alg 1 input. While I may have missed these definitions, I would suggest not burying them in the text and making key components clear and easier for readers to refer to. 2) I
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Coding and Compression Technologies · Video Analysis and Summarization · Digital Media Forensic Detection
