Representation Learning by Detecting Incorrect Location Embeddings
Sepehr Sameni, Simon Jenni, Paolo Favaro

TL;DR
This paper introduces DILEMMA, a self-supervised learning method that detects artificially misplaced object parts to improve image representation learning, leading to better performance on shape-dependent tasks.
Contribution
The novel DILEMMA method detects incorrect location embeddings in self-supervised learning, enhancing existing models' performance and robustness, especially for shape-reliant tasks.
Findings
Improves MoCoV3, DINO, and SimCLR performance by 0.5-4.41%.
Enhances fine-tuning results on ImageNet-100.
Significantly benefits shape-dependent downstream tasks.
Abstract
In this paper, we introduce a novel self-supervised learning (SSL) loss for image representation learning. There is a growing belief that generalization in deep neural networks is linked to their ability to discriminate object shapes. Since object shape is related to the location of its parts, we propose to detect those that have been artificially misplaced. We represent object parts with image tokens and train a ViT to detect which token has been combined with an incorrect positional embedding. We then introduce sparsity in the inputs to make the model more robust to occlusions and to speed up the training. We call our method DILEMMA, which stands for Detection of Incorrect Location EMbeddings with MAsked inputs. We apply DILEMMA to MoCoV3, DINO and SimCLR and show an improvement in their performance of respectively 4.41%, 3.97%, and 0.5% under the same training time and with a linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Linear Layer · Vision Transformer · *Communicated@Fast*How Do I Communicate to Expedia? · Masked autoencoder · 1x1 Convolution · Dense Connections
