The Dynamic Duo of Collaborative Masking and Target for Advanced Masked   Autoencoder Learning

Shentong Mo

arXiv:2412.17566·cs.CV·December 24, 2024

The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning

Shentong Mo

PDF

Open Access 1 Video

TL;DR

This paper introduces CMT-MAE, a novel self-supervised learning framework that enhances masked autoencoders by integrating collaborative masking and target strategies through feedback between teacher and student models, achieving state-of-the-art results.

Contribution

It proposes a collaborative masking and target mechanism leveraging teacher-student feedback, improving masked autoencoder performance on vision tasks.

Findings

01

Achieves state-of-the-art linear probing and fine-tuning performance on ImageNet-1K.

02

Improves ViT-base fine-tuning results from 83.6% to 85.7%.

03

Demonstrates the effectiveness of collaborative feedback in self-supervised learning.

Abstract

Masked autoencoders (MAE) have recently succeeded in self-supervised vision representation learning. Previous work mainly applied custom-designed (e.g., random, block-wise) masking or teacher (e.g., CLIP)-guided masking and targets. However, they ignore the potential role of the self-training (student) model in giving feedback to the teacher for masking and targets. In this work, we present to integrate Collaborative Masking and Targets for boosting Masked AutoEncoders, namely CMT-MAE. Specifically, CMT-MAE leverages a simple collaborative masking mechanism through linear aggregation across attentions from both teacher and student models. We further propose using the output features from those two models as the collaborative target of the decoder. Our simple and effective framework pre-trained on ImageNet-1K achieves state-of-the-art linear probing and fine-tuning performance. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

The Dynamic Duo of Collaborative Masking and Target for Advanced Masked Autoencoder Learning· underline

Taxonomy

TopicsNeural Networks and Applications · Image Processing and 3D Reconstruction

MethodsMasked autoencoder