R-MAE: Regions Meet Masked Autoencoders
Duy-Kien Nguyen, Vaibhav Aggarwal, Yanghao Li, Martin R. Oswald,, Alexander Kirillov, Cees G. M. Snoek, Xinlei Chen

TL;DR
R-MAE introduces a masked region autoencoding approach that enhances self-supervised image representation learning by focusing on regions instead of individual pixels, leading to improved performance in detection and segmentation tasks.
Contribution
The paper proposes a novel masked region autoencoding method integrated with MAE, effectively leveraging regions for self-supervised learning with minimal computational overhead.
Findings
Consistent improvements across multiple datasets and benchmarks.
Enhanced capabilities for interactive segmentation.
Negligible additional computational costs.
Abstract
In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation. The code is provided at…
Peer Reviews
Decision·ICLR 2024 poster
1. The paper is highly clear in its presentation, effectively conveying the proposed methodology with its motivation. 2. The paper extends the traditional Masked Autoencoding (MAE) approach by considering regions as visual analogs of words. The concept of using regions for interactive segmentation is also original. 3. The proposed method can consistently help downstream performance on localization-related tasks (e.g., detection and segmentation).
1. The paper could benefit from a more extensive comparison with existing methods in the field. While it highlights the strengths of R-MAE, a more in-depth quantitative comparison with other state-of-the-art self-supervised learning techniques (based on MAE) would strengthen the paper.
- The paper is generally well-written with solid experimental results. - The paper has a clear explanation of the proposed method with valid qualitative demonstration. - Beside common transfer learning experiments, the authors also explore the usage of R-MAE for interactive segmentation.
- About the importance of regions: - As discussed in Sec. 2, the authors claim that there are many different sources to obtain regions. In other words, here regions do not have a specific definition, especially under the context of unsupervised learning. Or in another words, the best definition of regions might differ with respect to different downstream tasks, while just for object detection and semantic segmentation, SAM proposals might be the best. - One following question to explain is t
1. The overall paper is clear and easy to follow. 2. The analysis for different designs of regions is comprehensive. 3. This paper gives some suggestions when training with regions in MIM. 4. The authors will open the source code and models.
1. Self-supervised contrastive learning needs to introduce the concept of region to focus local information due to its a priori assumption of image semantic consistency, but MAE does not have this problem. Moreover, I agree that the reconstruction of raw pixel values lacks a higher level of semantic information for image understanding compared to word reconstruction in NLP. However, I do not agree the introduction of binary regions adds high-level semantics. Therefore, I argue this paper is the
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications
MethodsMasked autoencoder
