The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training
Hao Liu, Xinghua Jiang, Xin Li, Antai Guo, Deqiang Jiang, Bo Ren

TL;DR
This paper introduces Ge$^2$-AE, a novel self-supervised visual pre-training method that reconstructs images in both pixel and frequency domains using dual decoders, leading to more robust representations.
Contribution
It proposes the first MIM approach utilizing frequency domain reconstruction with geminated decoders for improved visual representation learning.
Findings
Enhanced downstream recognition performance.
Robustness of learned representations confirmed.
Effective in both quantitative and qualitative evaluations.
Abstract
The self-supervised Masked Image Modeling (MIM) schema, following "mask-and-reconstruct" pipeline of recovering contents from masked image, has recently captured the increasing interest in the multimedia community, owing to the excellent ability of learning visual representation from unlabeled data. Aiming at learning representations with high semantics abstracted, a group of works attempts to reconstruct non-semantic pixels with large-ratio masking strategy, which may suffer from "over-smoothing" problem, while others directly infuse semantics into targets in off-line way requiring extra data. Different from them, we shift the perspective to the Fourier domain which naturally has global perspective and present a new Masked Image Modeling (MIM), termed Geminated Gestalt Autoencoder (Ge-AE) for visual pre-training. Specifically, we equip our model with geminated decoders in charge of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Image and Video Retrieval Techniques
