Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

Xiusheng Huang; Xin Jiang; Jun Zhao; Kang Liu; Yequan Wang

arXiv:2605.16384·cs.CV·May 19, 2026

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

Xiusheng Huang, Xin Jiang, Jun Zhao, Kang Liu, Yequan Wang

PDF

TL;DR

TaTok is a theoretically grounded adaptive image tokenization framework that improves image processing by modeling mutual information with global tokens and reducing redundancy through dynamic filtering.

Contribution

It introduces global tokens and a dynamic filtering algorithm to enhance image tokenization, addressing information insufficiency and redundancy issues in existing methods.

Findings

01

Achieves 1.3x gFID improvement

02

Provides 8.7x inference speedup

03

Enables more compressed yet accurate image tokenization

Abstract

Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.