Robust Representation Learning in Masked Autoencoders

Anika Shrivastava; Renu Rameshan; Samar Agnihotri

arXiv:2602.03531·cs.LG·February 4, 2026

Robust Representation Learning in Masked Autoencoders

Anika Shrivastava, Renu Rameshan, Samar Agnihotri

PDF

Open Access

TL;DR

This paper investigates why Masked Autoencoders (MAEs) perform well in image classification, revealing their robust representations, class-aware latent space construction, and persistent global attention, supported by new analysis methods.

Contribution

It introduces a detailed layer-wise analysis of MAE representations, demonstrating their robustness and class separation, and proposes new metrics for feature sensitivity under degradations.

Findings

01

MAE representations are robust to image degradations.

02

Pretrained MAE embeddings become increasingly class-separable across layers.

03

MAE exhibits persistent global attention unlike standard ViTs.

Abstract

Masked Autoencoders (MAEs) achieve impressive performance in image classification tasks, yet the internal representations they learn remain less understood. This work started as an attempt to understand the strong downstream classification performance of MAE. In this process we discover that representations learned with the pretraining and fine-tuning, are quite robust - demonstrating a good classification performance in the presence of degradations, such as blur and occlusions. Through layer-wise analysis of token embeddings, we show that pretrained MAE progressively constructs its latent space in a class-aware manner across network depth: embeddings from different classes lie in subspaces that become increasingly separable. We further observe that MAE exhibits early and persistent global attention across encoder layers, in contrast to standard Vision Transformers (ViTs). To quantify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Domain Adaptation and Few-Shot Learning