On the Role of Discrete Tokenization in Visual Representation Learning
Tianqi Du, Yifei Wang, Yisen Wang

TL;DR
This paper investigates the impact of discrete tokenization in masked image modeling, providing theoretical insights, introducing a new metric TCAS, and proposing a novel tokenizer and MIM method called ClusterMIM that outperforms existing approaches.
Contribution
It offers a theoretical analysis of discrete tokenization in MIM, introduces the TCAS metric, and presents ClusterMIM, a new method with improved performance on benchmarks.
Findings
Discrete tokenization enhances generalization in MIM.
The TCAS metric effectively evaluates tokenization quality.
ClusterMIM achieves superior results on benchmark datasets.
Abstract
In the realm of self-supervised learning (SSL), masked image modeling (MIM) has gained popularity alongside contrastive learning methods. MIM involves reconstructing masked regions of input images using their unmasked portions. A notable subset of MIM methodologies employs discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored. In this paper, we explore the role of these discrete tokens, aiming to unravel their benefits and limitations. Building upon the connection between MIM and contrastive learning, we provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities. Furthermore, we propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework. Inspired by this metric, we contribute an…
Peer Reviews
Decision·ICLR 2024 spotlight
- The main problem, "What is the role of tokenization in MIM? How does it affect downstream performance?", is interesting and necessary. - The proposed token-class alignment similarity (TCAS) is a cheap but effective metric to measure the performance of tokenization roles. - This paper theoretically and empirically demonstrates the superiority of class-wise tokenization. - The paper is well-organized and easy to follow.
- It would be better to show more ablation results about clustering numbers in Table 4.
- Theoretical analysis on a discrete tokenization method looks novel and interesting. - A metric for tokenization (TCAS) would help a lot of researchers to investigate MIM.
- I think using discrete tokenization is not a mainstream of MIM. Representative methods, such as MAE, MaskFeat, and data2vec, demonstrate impressive performance without the discrete tokens. Thus, the contribution of the paper is hard to cover diverse variants of MIM. - According to theorem 1, image classification training could be the best way to downstream error bound. But, in practice, MIM works better than classification training in a lot of cases. Thus, I doubt the general applicability of
The authors present an intriguing tokenization approach using graph representation, and the paper is both technically robust and clearly articulated
The theoretical analysis is only considered for two classes. I wonder if this can be extended into multiple classes.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Image Retrieval and Classification Techniques
MethodsMutual Information Machine/Mask Image Modeling · Contrastive Learning
