On the Role of Discrete Tokenization in Visual Representation Learning

Tianqi Du; Yifei Wang; Yisen Wang

arXiv:2407.09087·cs.LG·July 15, 2024·1 cites

On the Role of Discrete Tokenization in Visual Representation Learning

Tianqi Du, Yifei Wang, Yisen Wang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper investigates the impact of discrete tokenization in masked image modeling, providing theoretical insights, introducing a new metric TCAS, and proposing a novel tokenizer and MIM method called ClusterMIM that outperforms existing approaches.

Contribution

It offers a theoretical analysis of discrete tokenization in MIM, introduces the TCAS metric, and presents ClusterMIM, a new method with improved performance on benchmarks.

Findings

01

Discrete tokenization enhances generalization in MIM.

02

The TCAS metric effectively evaluates tokenization quality.

03

ClusterMIM achieves superior results on benchmark datasets.

Abstract

In the realm of self-supervised learning (SSL), masked image modeling (MIM) has gained popularity alongside contrastive learning methods. MIM involves reconstructing masked regions of input images using their unmasked portions. A notable subset of MIM methodologies employs discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored. In this paper, we explore the role of these discrete tokens, aiming to unravel their benefits and limitations. Building upon the connection between MIM and contrastive learning, we provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities. Furthermore, we propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework. Inspired by this metric, we contribute an…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 8· accept, good paperConfidence 3

Strengths

- The main problem, "What is the role of tokenization in MIM? How does it affect downstream performance?", is interesting and necessary. - The proposed token-class alignment similarity (TCAS) is a cheap but effective metric to measure the performance of tokenization roles. - This paper theoretically and empirically demonstrates the superiority of class-wise tokenization. - The paper is well-organized and easy to follow.

Weaknesses

- It would be better to show more ablation results about clustering numbers in Table 4.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

- Theoretical analysis on a discrete tokenization method looks novel and interesting. - A metric for tokenization (TCAS) would help a lot of researchers to investigate MIM.

Weaknesses

- I think using discrete tokenization is not a mainstream of MIM. Representative methods, such as MAE, MaskFeat, and data2vec, demonstrate impressive performance without the discrete tokens. Thus, the contribution of the paper is hard to cover diverse variants of MIM. - According to theorem 1, image classification training could be the best way to downstream error bound. But, in practice, MIM works better than classification training in a lot of cases. Thus, I doubt the general applicability of

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The authors present an intriguing tokenization approach using graph representation, and the paper is both technically robust and clearly articulated

Weaknesses

The theoretical analysis is only considered for two classes. I wonder if this can be extended into multiple classes.

Code & Models

Repositories

pku-ml/clustermim
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection · Image Retrieval and Classification Techniques

MethodsMutual Information Machine/Mask Image Modeling · Contrastive Learning