Multi-Token Enhancing for Vision Representation Learning

Zhong-Yu Li; Yu-Song Hu; Bo-Wen Yin; Ming-Ming Cheng

arXiv:2411.15787·cs.CV·November 26, 2024

Multi-Token Enhancing for Vision Representation Learning

Zhong-Yu Li, Yu-Song Hu, Bo-Wen Yin, Ming-Ming Cheng

PDF

Open Access

TL;DR

This paper introduces Multi-Token Enhancing (MTE), a method that extracts multiple auxiliary tokens during training to improve vision representation learning without increasing inference costs, by distilling their knowledge into a global token.

Contribution

MTE is a novel approach that enhances self-supervised vision models by using auxiliary tokens during training and distilling their knowledge, avoiding additional inference costs.

Findings

01

Consistently improves performance across downstream tasks.

02

Compatible with various self-supervised loss functions and architectures.

03

Reduces additional inference costs through knowledge distillation.

Abstract

Vision representation learning, especially self-supervised learning, is pivotal for various vision applications. Ensemble learning has also succeeded in enhancing the performance and robustness of the vision models. However, traditional ensemble strategies are impractical for representation learning, especially self-supervised representation learning that requires large-scale datasets and long schedules. This is because they require k times more training and inference computation costs for an ensemble of k models. Differently, we introduce Multi-Token Enhancing (MTE) that extracts multiple auxiliary tokens simultaneously from a single model to enhance representation learning, while incurring minimal additional training costs and no additional inference costs. These auxiliary tokens, including auxiliary CLS tokens and adaptively pooled tokens, capture complementary information due to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques