CoMAD: A Multiple-Teacher Self-Supervised Distillation Framework
Sriram Mandalika, Lalitha V

TL;DR
CoMAD is a novel self-supervised distillation framework that unifies multiple vision transformer teachers into a compact student, improving performance on image classification and dense prediction tasks.
Contribution
Introduces a parameter-free, multi-teacher distillation method with asymmetric masking and consensus gating for efficient self-supervised learning.
Findings
Achieves 75.4% Top-1 accuracy on ImageNet-1K with ViT-Tiny.
Sets new state-of-the-art in dense prediction tasks for compact SSL models.
Improves previous methods by integrating multiple teacher priors effectively.
Abstract
Numerous self-supervised learning paradigms, such as contrastive learning and masked image modeling, learn powerful representations from unlabeled data but are typically pretrained in isolation, overlooking complementary insights and yielding large models that are impractical for resource-constrained deployment. To overcome these challenges, we introduce Consensus-oriented Masked Distillation (CoMAD), a lightweight, parameter-free framework that unifies knowledge from multiple current state-of-the-art self-supervised Vision Transformers into a compact student network. CoMAD distills from three pretrained ViT-Base teachers, MAE, MoCo v3, and iBOT, each offering distinct semantic and contextual priors. Rather than naively averaging teacher outputs, we apply asymmetric masking: the student sees only 25 percent of patches while each teacher receives a progressively lighter, unique mask,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
