R-MAE: Regions Meet Masked Autoencoders

Duy-Kien Nguyen; Vaibhav Aggarwal; Yanghao Li; Martin R. Oswald,; Alexander Kirillov; Cees G. M. Snoek; Xinlei Chen

arXiv:2306.05411·cs.CV·January 8, 2024·1 cites

R-MAE: Regions Meet Masked Autoencoders

Duy-Kien Nguyen, Vaibhav Aggarwal, Yanghao Li, Martin R. Oswald,, Alexander Kirillov, Cees G. M. Snoek, Xinlei Chen

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

R-MAE introduces a masked region autoencoding approach that enhances self-supervised image representation learning by focusing on regions instead of individual pixels, leading to improved performance in detection and segmentation tasks.

Contribution

The paper proposes a novel masked region autoencoding method integrated with MAE, effectively leveraging regions for self-supervised learning with minimal computational overhead.

Findings

01

Consistent improvements across multiple datasets and benchmarks.

02

Enhanced capabilities for interactive segmentation.

03

Negligible additional computational costs.

Abstract

In this work, we explore regions as a potential visual analogue of words for self-supervised image representation learning. Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions. Specifically, we design an architecture which efficiently addresses the one-to-many mapping between images and regions, while being highly effective especially with high-quality regions. When integrated with MAE, our approach (R-MAE) demonstrates consistent improvements across various pre-training datasets and downstream detection and segmentation benchmarks, with negligible computational overheads. Beyond the quantitative evaluation, our analysis indicates the models pre-trained with masked region autoencoding unlock the potential for interactive segmentation. The code is provided at…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The paper is highly clear in its presentation, effectively conveying the proposed methodology with its motivation. 2. The paper extends the traditional Masked Autoencoding (MAE) approach by considering regions as visual analogs of words. The concept of using regions for interactive segmentation is also original. 3. The proposed method can consistently help downstream performance on localization-related tasks (e.g., detection and segmentation).

Weaknesses

1. The paper could benefit from a more extensive comparison with existing methods in the field. While it highlights the strengths of R-MAE, a more in-depth quantitative comparison with other state-of-the-art self-supervised learning techniques (based on MAE) would strengthen the paper.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

- The paper is generally well-written with solid experimental results. - The paper has a clear explanation of the proposed method with valid qualitative demonstration. - Beside common transfer learning experiments, the authors also explore the usage of R-MAE for interactive segmentation.

Weaknesses

- About the importance of regions: - As discussed in Sec. 2, the authors claim that there are many different sources to obtain regions. In other words, here regions do not have a specific definition, especially under the context of unsupervised learning. Or in another words, the best definition of regions might differ with respect to different downstream tasks, while just for object detection and semantic segmentation, SAM proposals might be the best. - One following question to explain is t

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

1. The overall paper is clear and easy to follow. 2. The analysis for different designs of regions is comprehensive. 3. This paper gives some suggestions when training with regions in MIM. 4. The authors will open the source code and models.

Weaknesses

1. Self-supervised contrastive learning needs to introduce the concept of region to focus local information due to its a priori assumption of image semantic consistency, but MAE does not have this problem. Moreover, I agree that the reconstruction of raw pixel values lacks a higher level of semantic information for image understanding compared to word reconstruction in NLP. However, I do not agree the introduction of binary regions adds high-level semantics. Therefore, I argue this paper is the

Code & Models

Repositories

facebookresearch/r-mae
pytorchOfficial

Videos

R-MAE: Regions Meet Masked Autoencoders· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsMasked autoencoder