From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations

Anthony Bisulco; Rahul Ramesh; Randall Balestriero; Pratik Chaudhari

arXiv:2508.15404·cs.CV·August 25, 2025

From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations

Anthony Bisulco, Rahul Ramesh, Randall Balestriero, Pratik Chaudhari

PDF

TL;DR

This paper analyzes how Masked Autoencoders (MAEs) learn spatial correlations in images, revealing how hyperparameters influence feature types and providing guidance for hyperparameter selection in practice.

Contribution

It offers an analytical understanding of how MAE hyperparameters affect the learning of spatial features, extending analysis from linear to non-linear models.

Findings

01

Masking ratio and patch size influence the type of spatial correlations captured.

02

Non-linear MAEs adapt to dataset-specific spatial correlations beyond second-order statistics.

03

Insights are provided for practical hyperparameter tuning of MAEs.

Abstract

Masked Autoencoders (MAEs) have emerged as a powerful pretraining technique for vision foundation models. Despite their effectiveness, they require extensive hyperparameter tuning (masking ratio, patch size, encoder/decoder layers) when applied to novel datasets. While prior theoretical works have analyzed MAEs in terms of their attention patterns and hierarchical latent variable models, the connection between MAE hyperparameters and performance on downstream tasks is relatively unexplored. This work investigates how MAEs learn spatial correlations in the input image. We analytically derive the features learned by a linear MAE and show that masking ratio and patch size can be used to select for features that capture short- and long-range spatial correlations. We extend this analysis to non-linear MAEs to show that MAE representations adapt to spatial correlations in the dataset, beyond…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.