Cluster and Predict Latent Patches for Improved Masked Image Modeling
Timoth\'ee Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, Piotr Bojanowski

TL;DR
This paper introduces CAPI, a novel masked image modeling framework that predicts latent clusterings, leading to significant improvements in image classification and segmentation tasks over previous MIM methods.
Contribution
CAPI is a new pure-MIM framework that uses clustering-based loss for stable training and better scaling, achieving state-of-the-art results.
Findings
Achieves 83.8% accuracy on ImageNet with ViT-L
Obtains 32.1% mIoU on ADE20K with simple linear probes
Substantially outperforms previous MIM methods
Abstract
Masked Image Modeling (MIM) offers a promising approach to self-supervised representation learning, however existing MIM models still lag behind the state-of-the-art. In this paper, we systematically analyze target representations, loss functions, and architectures, to introduce CAPI - a novel pure-MIM framework that relies on the prediction of latent clusterings. Our approach leverages a clustering-based loss, which is stable to train, and exhibits promising scaling properties. Our ViT-L backbone, CAPI, achieves 83.8% accuracy on ImageNet and 32.1% mIoU on ADE20K with simple linear probes, substantially outperforming previous MIM methods and approaching the performance of the current state-of-the-art, DINOv2. We release all our code and models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗birder-project/rope_vit_reg4_b14_capimodel· 74 dl74 dl
- 🤗birder-project/rope_vit_reg4_b14_capi-imagenet21kmodel· 70 dl70 dl
- 🤗birder-project/rope_vit_reg4_b14_capi-places365model· 20 dl· ♡ 120 dl♡ 1
- 🤗birder-project/rope_vit_reg4_b14_capi-inat21model· 117 dl117 dl
- 🤗birder-project/rope_vit_reg8_so150m_p14_swiglu_rms_avg_capimodel· 8 dl8 dl
- 🤗birder-project/rope_vit_reg8_so150m_p14_swiglu_rms_ap_rotnet-capimodel· 10 dl10 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis
MethodsMutual Information Machine/Mask Image Modeling
