Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders
Alexandre Eyma\"el, Renaud Vandeghen, Anthony Cioppa, Silvio Giancola,, Bernard Ghanem, Marc Van Droogenbroeck

TL;DR
CropMAE introduces a novel self-supervised image pre-training method using cropped image pairs from the same image, achieving high masking ratios and learning object-centric representations without video data or explicit motion cues.
Contribution
It proposes CropMAE, a new pre-training approach that reduces reliance on video datasets and explicit motion, while maintaining competitive performance and enabling higher masking ratios.
Findings
CropMAE achieves the highest masking ratio to date (98.5%).
It learns object-centric representations without explicit motion.
It reduces pre-training and learning time significantly.
Abstract
Self-supervised pre-training of image encoders is omnipresent in the literature, particularly following the introduction of Masked autoencoders (MAE). Current efforts attempt to learn object-centric representations from motion in videos. In particular, SiamMAE recently introduced a Siamese network, training a shared-weight encoder from two frames of a video with a high asymmetric masking ratio (95%). In this work, we propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE. Our method specifically differs by exclusively considering pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video. CropMAE therefore alleviates the need for video datasets, while maintaining competitive performances and drastically reducing pre-training and learning time. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
