Refine and Represent: Region-to-Object Representation Learning
Akash Gokul, Konstantinos Kallidromitis, Shufan Li, Yusuke Kato,, Kazuki Kozuka, Trevor Darrell, and Colorado J Reed

TL;DR
This paper introduces R2O, a unified self-supervised pretraining method that refines regions into object-centric masks, leading to state-of-the-art results in various dense prediction tasks and unsupervised object segmentation.
Contribution
R2O unifies region-based and object-centric pretraining through a dynamic refinement process and a curriculum, improving dense prediction and segmentation performance.
Findings
State-of-the-art semantic segmentation on PASCAL VOC and Cityscapes.
Improved instance segmentation on MS COCO.
Superior unsupervised object segmentation on Caltech-UCSD Birds dataset.
Abstract
Recent works in self-supervised learning have demonstrated strong performance on scene-level dense prediction tasks by pretraining with object-centric or region-based correspondence objectives. In this paper, we present Region-to-Object Representation Learning (R2O) which unifies region-based and object-centric pretraining. R2O operates by training an encoder to dynamically refine region-based segments into object-centric masks and then jointly learns representations of the contents within the mask. R2O uses a "region refinement module" to group small image regions, generated using a region-level prior, into larger regions which tend to correspond to objects by clustering region-level features. As pretraining progresses, R2O follows a region-to-object curriculum which encourages learning region-level features early on and gradually progresses to train object-centric representations.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
