Multi-Token Enhancing for Vision Representation Learning
Zhong-Yu Li, Yu-Song Hu, Bo-Wen Yin, Ming-Ming Cheng

TL;DR
This paper introduces Multi-Token Enhancing (MTE), a method that extracts multiple auxiliary tokens during training to improve vision representation learning without increasing inference costs, by distilling their knowledge into a global token.
Contribution
MTE is a novel approach that enhances self-supervised vision models by using auxiliary tokens during training and distilling their knowledge, avoiding additional inference costs.
Findings
Consistently improves performance across downstream tasks.
Compatible with various self-supervised loss functions and architectures.
Reduces additional inference costs through knowledge distillation.
Abstract
Vision representation learning, especially self-supervised learning, is pivotal for various vision applications. Ensemble learning has also succeeded in enhancing the performance and robustness of the vision models. However, traditional ensemble strategies are impractical for representation learning, especially self-supervised representation learning that requires large-scale datasets and long schedules. This is because they require k times more training and inference computation costs for an ensemble of k models. Differently, we introduce Multi-Token Enhancing (MTE) that extracts multiple auxiliary tokens simultaneously from a single model to enhance representation learning, while incurring minimal additional training costs and no additional inference costs. These auxiliary tokens, including auxiliary CLS tokens and adaptively pooled tokens, capture complementary information due to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
