MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments
Spyros Gidaris, Andrei Bursuc, Oriane Simeoni, Antonin Vobecky, Nikos, Komodakis, Matthieu Cord, Patrick P\'erez

TL;DR
MOCA is a self-supervised learning method that combines masked image modeling and contrastive learning using high-level features, achieving state-of-the-art results efficiently.
Contribution
It introduces a unified mask-and-predict framework that leverages high-level features, combining two paradigms for improved efficiency and performance.
Findings
State-of-the-art results in low-shot settings
Training at least 3 times faster than prior methods
Effective combination of masked modeling and contrastive learning
Abstract
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks for very large fully-annotated datasets. Different classes of self-supervised learning offer representations with either good contextual reasoning properties, e.g., using masked image modeling strategies, or invariance to image perturbations, e.g., with contrastive methods. In this work, we propose a single-stage and standalone method, MOCA, which unifies both desired properties using novel mask-and-predict objectives defined with high-level features (instead of pixel-level details). Moreover, we show how to effectively employ both learning paradigms in a synergistic and computation-efficient way. Doing so, we achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols with a training that is at least 3 times faster than prior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization · Label Smoothing
