What Makes for Good Tokenizers in Vision Transformer?
Shengju Qian, Yi Zhu, Wenbo Li, Mu Li, Jiaya Jia

TL;DR
This paper investigates the design of tokenizers in vision transformers from an information trade-off perspective, proposing new strategies that improve performance with minimal overhead.
Contribution
It introduces the Modulation across Tokens (MoTo) and TokenProp regularization, providing a unified understanding and improved design strategies for vision tokenizers.
Findings
Enhanced transformer performance with MoTo and TokenProp
Inter-token modeling improves information extraction
Design choices in tokenizers significantly impact vision transformer effectiveness
Abstract
The architecture of transformers, which recently witness booming applications in vision tasks, has pivoted against the widespread convolutional paradigm. Relying on the tokenization process that splits inputs into multiple tokens, transformers are capable of extracting their pairwise relationships using self-attention. While being the stemming building block of transformers, what makes for a good tokenizer has not been well understood in computer vision. In this work, we investigate this uncharted problem from an information trade-off perspective. In addition to unifying and understanding existing structural modifications, our derivation leads to better design strategies for vision tokenizers. The proposed Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Furthermore, a regularization objective TokenProp is embraced in the standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection
