Beyond [cls]: Exploring the true potential of Masked Image Modeling representations
Marcin Przewi\k{e}\'zlikowski, Randall Balestriero, Wojciech Jasi\'nski, Marek \'Smieja, Bartosz Zieli\'nski

TL;DR
This paper investigates why Masked Image Modeling (MIM) performs poorly out-of-the-box and introduces Selective Aggregation to enhance its effectiveness without fine-tuning.
Contribution
The paper identifies the cause of MIM's poor performance as ineffective patch aggregation and proposes a novel Selective Aggregation method to improve representations.
Findings
Attention in MIM is spread uniformly over patches
Selective Aggregation improves out-of-the-box performance
Enhanced semantic capture from patch tokens
Abstract
Masked Image Modeling (MIM) has emerged as a promising approach for Self-Supervised Learning (SSL) of visual representations. However, the out-of-the-box performance of MIMs is typically inferior to competing approaches. Most users cannot afford fine-tuning due to the need for large amounts of data, high GPU consumption, and specialized user knowledge. Therefore, the practical use of MIM representations is limited. In this paper we ask what is the reason for the poor out-of-the-box performance of MIMs. Is it due to weaker features produced by MIM models, or is it due to suboptimal usage? Through detailed analysis, we show that attention in MIMs is spread almost uniformly over many patches, leading to ineffective aggregation by the [cls] token. Based on this insight, we propose Selective Aggregation to better capture the rich semantic information retained in patch tokens, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · Data Visualization and Analytics · 3D Modeling in Geospatial Applications
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training · Mutual Information Machine/Mask Image Modeling
