Self-supervised Learning of Contextualized Local Visual Embeddings
Thalles Santos Silva, Helio Pedrini, Ad\'in Ram\'irez Rivera

TL;DR
This paper introduces CLoVE, a self-supervised method that learns contextualized local visual embeddings using a novel attention mechanism, achieving state-of-the-art results in dense prediction tasks.
Contribution
The paper proposes a new self-supervised learning approach with a normalized multi-head self-attention layer for dense visual representations, outperforming existing methods.
Findings
State-of-the-art performance in object detection
Superior results in instance segmentation
Effective in keypoint detection and dense pose estimation
Abstract
We present Contextualized Local Visual Embeddings (CLoVE), a self-supervised convolutional-based method that learns representations suited for dense prediction tasks. CLoVE deviates from current methods and optimizes a single loss function that operates at the level of contextualized local embeddings learned from output feature maps of convolution neural network (CNN) encoders. To learn contextualized embeddings, CLoVE proposes a normalized mult-head self-attention layer that combines local features from different parts of an image based on similarity. We extensively benchmark CLoVE's pre-trained representations on multiple datasets. CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks, including object detection, instance segmentation, keypoint detection, and dense pose estimation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Human Pose and Action Recognition · Digital Imaging for Blood Diseases
MethodsConvolution
