The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning
Shentong Mo

TL;DR
This paper introduces CMT-MAE, a novel self-supervised learning framework that enhances masked autoencoders by integrating collaborative masking and target strategies through feedback between teacher and student models, achieving state-of-the-art results.
Contribution
It proposes a collaborative masking and target mechanism leveraging teacher-student feedback, improving masked autoencoder performance on vision tasks.
Findings
Achieves state-of-the-art linear probing and fine-tuning performance on ImageNet-1K.
Improves ViT-base fine-tuning results from 83.6% to 85.7%.
Demonstrates the effectiveness of collaborative feedback in self-supervised learning.
Abstract
Masked autoencoders (MAE) have recently succeeded in self-supervised vision representation learning. Previous work mainly applied custom-designed (e.g., random, block-wise) masking or teacher (e.g., CLIP)-guided masking and targets. However, they ignore the potential role of the self-training (student) model in giving feedback to the teacher for masking and targets. In this work, we present to integrate Collaborative Masking and Targets for boosting Masked AutoEncoders, namely CMT-MAE. Specifically, CMT-MAE leverages a simple collaborative masking mechanism through linear aggregation across attentions from both teacher and student models. We further propose using the output features from those two models as the collaborative target of the decoder. Our simple and effective framework pre-trained on ImageNet-1K achieves state-of-the-art linear probing and fine-tuning performance. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications · Image Processing and 3D Reconstruction
MethodsMasked autoencoder
