Visualizing the loss landscape of Self-supervised Vision Transformer
Youngwan Lee, Jeffrey Ryan Willette, Jonghee Kim, Sung Ju Hwang

TL;DR
This paper visualizes and compares the loss landscapes of self-supervised vision transformers trained by MAE and RC-MAE with supervised ViT, revealing smoother loss curvature and wider convexity regions that explain their better generalization.
Contribution
It introduces the first visualization of loss landscapes for self-supervised ViT, highlighting how MAE and RC-MAE improve optimization and generalization compared to supervised training.
Findings
MAE-ViT exhibits smoother, wider loss landscapes than supervised ViT.
EMA-teacher in RC-MAE widens convexity regions, aiding convergence.
Loss landscape visualization explains better generalization of self-supervised ViT.
Abstract
The Masked autoencoder (MAE) has drawn attention as a representative self-supervised approach for masked image modeling with vision transformers. However, even though MAE shows better generalization capability than fully supervised training from scratch, the reason why has not been explored. In another line of work, the Reconstruction Consistent Masked Auto Encoder (RC-MAE), has been proposed which adopts a self-distillation scheme in the form of an exponential moving average (EMA) teacher into MAE, and it has been shown that the EMA-teacher performs a conditional gradient correction during optimization. To further investigate the reason for better generalization of the self-supervised ViT when trained by MAE (MAE-ViT) and the effect of the gradient correction of RC-MAE from the perspective of optimization, we visualize the loss landscapes of the self-supervised vision transformer by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors
MethodsAttention Is All You Need · Dense Connections · Softmax · Layer Normalization · Linear Layer · Masked autoencoder · Multi-Head Attention · Residual Connection · Vision Transformer
