MOMA:Distill from Self-Supervised Teachers
Yuchong Yao, Nandakishor Desai, Marimuthu Palaniswami

TL;DR
MOMA is a self-supervised distillation framework that combines knowledge from contrastive learning and masked image modeling to produce compact, high-performing models efficiently.
Contribution
It introduces a novel method to distill knowledge from pre-trained MoCo and MAE models into a single student, enhancing performance while reducing training costs.
Findings
MOMA achieves competitive results on various benchmarks.
The method reduces training epochs and computational costs.
It effectively combines two self-supervised paradigms for improved representations.
Abstract
Contrastive Learning and Masked Image Modelling have demonstrated exceptional performance on self-supervised representation learning, where Momentum Contrast (i.e., MoCo) and Masked AutoEncoder (i.e., MAE) are the state-of-the-art, respectively. In this work, we propose MOMA to distill from pre-trained MoCo and MAE in a self-supervised manner to collaborate the knowledge from both paradigms. We introduce three different mechanisms of knowledge transfer in the propsoed MOMA framework. : (1) Distill pre-trained MoCo to MAE. (2) Distill pre-trained MAE to MoCo (3) Distill pre-trained MoCo and MAE to a random initialized student. During the distillation, the teacher and the student are fed with original inputs and masked inputs, respectively. The learning is enabled by aligning the normalized representations from the teacher and the projected representations from the student. This simple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsMasked autoencoder · InfoNCE · Batch Normalization · Momentum Contrast
