Cluster and Predict Latent Patches for Improved Masked Image Modeling

Timoth\'ee Darcet; Federico Baldassarre; Maxime Oquab; Julien Mairal; Piotr Bojanowski

arXiv:2502.08769·cs.CV·July 1, 2025

Cluster and Predict Latent Patches for Improved Masked Image Modeling

Timoth\'ee Darcet, Federico Baldassarre, Maxime Oquab, Julien Mairal, Piotr Bojanowski

PDF

Open Access 2 Repos 6 Models

TL;DR

This paper introduces CAPI, a novel masked image modeling framework that predicts latent clusterings, leading to significant improvements in image classification and segmentation tasks over previous MIM methods.

Contribution

CAPI is a new pure-MIM framework that uses clustering-based loss for stable training and better scaling, achieving state-of-the-art results.

Findings

01

Achieves 83.8% accuracy on ImageNet with ViT-L

02

Obtains 32.1% mIoU on ADE20K with simple linear probes

03

Substantially outperforms previous MIM methods

Abstract

Masked Image Modeling (MIM) offers a promising approach to self-supervised representation learning, however existing MIM models still lag behind the state-of-the-art. In this paper, we systematically analyze target representations, loss functions, and architectures, to introduce CAPI - a novel pure-MIM framework that relies on the prediction of latent clusterings. Our approach leverages a clustering-based loss, which is stable to train, and exhibits promising scaling properties. Our ViT-L backbone, CAPI, achieves 83.8% accuracy on ImageNet and 32.1% mIoU on ADE20K with simple linear probes, substantially outperforming previous MIM methods and approaching the performance of the current state-of-the-art, DINOv2. We release all our code and models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis

MethodsMutual Information Machine/Mask Image Modeling