SRMAE: Masked Image Modeling for Scale-Invariant Deep Representations
Zhiming Wang, Lin Gu, Feng Lu

TL;DR
SRMAE introduces a self-supervised masked image modeling approach using scale as a signal, leveraging super-resolution techniques to learn scale-invariant representations and achieve state-of-the-art results on low-resolution recognition tasks.
Contribution
The paper proposes a novel scale-aware masked autoencoder framework that incorporates super-resolution for improved scale-invariant visual representations.
Findings
Achieves 82.1% accuracy on ImageNet-1K after pre-training.
Surpasses existing methods in very low resolution recognition by 1.3%.
Outperforms state-of-the-art in low-resolution facial expression recognition by 9.48%.
Abstract
Due to the prevalence of scale variance in nature images, we propose to use image scale as a self-supervised signal for Masked Image Modeling (MIM). Our method involves selecting random patches from the input image and downsampling them to a low-resolution format. Our framework utilizes the latest advances in super-resolution (SR) to design the prediction head, which reconstructs the input from low-resolution clues and other patches. After 400 epochs of pre-training, our Super Resolution Masked Autoencoders (SRMAE) get an accuracy of 82.1% on the ImageNet-1K task. Image scale signal also allows our SRMAE to capture scale invariance representation. For the very low resolution (VLR) recognition task, our model achieves the best performance, surpassing DeriveNet by 1.3%. Our method also achieves an accuracy of 74.84% on the task of recognizing low-resolution facial expressions, surpassing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Face recognition and analysis · Generative Adversarial Networks and Image Synthesis
