Masked Frequency Modeling for Self-Supervised Visual Pre-Training
Jiahao Xie, Wei Li, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, Chen, Change Loy

TL;DR
This paper introduces Masked Frequency Modeling (MFM), a novel self-supervised pre-training method that masks and predicts frequency components of images, leading to improved visual representations without extra data or model complexity.
Contribution
MFM is the first to apply frequency domain masking for self-supervised visual pre-training, demonstrating its effectiveness across various models and tasks.
Findings
MFM achieves competitive image classification and segmentation results.
MFM enhances robustness against various image corruptions.
Frequency-based masking reveals underlying image patterns more effectively.
Abstract
We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. Instead of randomly inserting mask tokens to the input embeddings in the spatial domain, in this paper, we shift the perspective to the frequency domain. Specifically, MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. Our key insight is that predicting masked components in the frequency domain is more ideal to reveal underlying image patterns rather than predicting masked patches in the spatial domain, due to the heavy spatial redundancy. Our findings suggest that with the right configuration of mask-and-predict strategy, both the structural information within high-frequency components and the low-level statistics among low-frequency counterparts are useful in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage Processing Techniques and Applications · Optical measurement and interference techniques · Domain Adaptation and Few-Shot Learning
