Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice
Xiusheng Huang, Xin Jiang, Jun Zhao, Kang Liu, Yequan Wang

TL;DR
TaTok is a theoretically grounded adaptive image tokenization framework that improves image processing by modeling mutual information with global tokens and reducing redundancy through dynamic filtering.
Contribution
It introduces global tokens and a dynamic filtering algorithm to enhance image tokenization, addressing information insufficiency and redundancy issues in existing methods.
Findings
Achieves 1.3x gFID improvement
Provides 8.7x inference speedup
Enables more compressed yet accurate image tokenization
Abstract
Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
