What Makes for Good Tokenizers in Vision Transformer?

Shengju Qian; Yi Zhu; Wenbo Li; Mu Li; Jiaya Jia

arXiv:2212.11115·cs.CV·December 22, 2022·1 cites

What Makes for Good Tokenizers in Vision Transformer?

Shengju Qian, Yi Zhu, Wenbo Li, Mu Li, Jiaya Jia

PDF

Open Access

TL;DR

This paper investigates the design of tokenizers in vision transformers from an information trade-off perspective, proposing new strategies that improve performance with minimal overhead.

Contribution

It introduces the Modulation across Tokens (MoTo) and TokenProp regularization, providing a unified understanding and improved design strategies for vision tokenizers.

Findings

01

Enhanced transformer performance with MoTo and TokenProp

02

Inter-token modeling improves information extraction

03

Design choices in tokenizers significantly impact vision transformer effectiveness

Abstract

The architecture of transformers, which recently witness booming applications in vision tasks, has pivoted against the widespread convolutional paradigm. Relying on the tokenization process that splits inputs into multiple tokens, transformers are capable of extracting their pairwise relationships using self-attention. While being the stemming building block of transformers, what makes for a good tokenizer has not been well understood in computer vision. In this work, we investigate this uncharted problem from an information trade-off perspective. In addition to unifying and understanding existing structural modifications, our derivation leads to better design strategies for vision tokenizers. The proposed Modulation across Tokens (MoTo) incorporates inter-token modeling capability through normalization. Furthermore, a regularization objective TokenProp is embraced in the standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection