Robust Representation Learning in Masked Autoencoders
Anika Shrivastava, Renu Rameshan, Samar Agnihotri

TL;DR
This paper investigates why Masked Autoencoders (MAEs) perform well in image classification, revealing their robust representations, class-aware latent space construction, and persistent global attention, supported by new analysis methods.
Contribution
It introduces a detailed layer-wise analysis of MAE representations, demonstrating their robustness and class separation, and proposes new metrics for feature sensitivity under degradations.
Findings
MAE representations are robust to image degradations.
Pretrained MAE embeddings become increasingly class-separable across layers.
MAE exhibits persistent global attention unlike standard ViTs.
Abstract
Masked Autoencoders (MAEs) achieve impressive performance in image classification tasks, yet the internal representations they learn remain less understood. This work started as an attempt to understand the strong downstream classification performance of MAE. In this process we discover that representations learned with the pretraining and fine-tuning, are quite robust - demonstrating a good classification performance in the presence of degradations, such as blur and occlusions. Through layer-wise analysis of token embeddings, we show that pretrained MAE progressively constructs its latent space in a class-aware manner across network depth: embeddings from different classes lie in subspaces that become increasingly separable. We further observe that MAE exhibits early and persistent global attention across encoder layers, in contrast to standard Vision Transformers (ViTs). To quantify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Domain Adaptation and Few-Shot Learning
