TL;DR
This paper reveals that Masked Autoencoders (MAE) inherently learn patch-level clustering early in training and introduces a self-guided masking strategy that improves learning efficiency and performance across vision tasks.
Contribution
The paper uncovers the intrinsic pattern learning in MAE and proposes a self-guided masking method that enhances pretraining without external data.
Findings
Self-guided masking improves downstream task performance.
MAE learns patch clustering early in training.
The method enhances learning efficiency without external models.
Abstract
Masked Autoencoder (MAE) is a self-supervised approach for representation learning, widely applicable to a variety of downstream tasks in computer vision. In spite of its success, it is still not fully uncovered what and how MAE exactly learns. In this paper, with an in-depth analysis, we discover that MAE intrinsically learns pattern-based patch-level clustering from surprisingly early stages of pretraining. Upon this understanding, we propose self-guided masked autoencoder, which internally generates informed mask by utilizing its progress in patch clustering, substituting the naive random masking of the vanilla MAE. Our approach significantly boosts its learning process without relying on any external models or supplementary information, keeping the benefit of self-supervised nature of MAE intact. Comprehensive experiments on various downstream tasks verify the effectiveness of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
