An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers
Gokul Karthik Kumar, Sahal Shaji Mullappilly, Abhishek Singh Gehlot

TL;DR
This paper investigates self-supervised learning methods for training object detection transformers, demonstrating faster initial convergence in some models and exploring various approaches like image reconstruction and jigsaw puzzles.
Contribution
It introduces self-supervised training approaches tailored for object detection transformers using CNN feature maps, an area not extensively studied before.
Findings
Faster convergence of DETR in early training epochs
No similar improvement observed with Deformable DETR in multi-task learning
Exploration of multiple self-supervised methods for object detection transformers
Abstract
Self-supervised learning (SSL) methods such as masked language modeling have shown massive performance gains by pretraining transformer models for a variety of natural language processing tasks. The follow-up research adapted similar methods like masked image modeling in vision transformer and demonstrated improvements in the image classification task. Such simple self-supervised methods are not exhaustively studied for object detection transformers (DETR, Deformable DETR) as their transformer encoder modules take input in the convolutional neural network (CNN) extracted feature space rather than the image space as in general vision transformers. However, the CNN feature maps still maintain the spatial relationship and we utilize this property to design self-supervised learning approaches to train the encoder of object detection transformers in pretraining and multi-task learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Adam · Byte Pair Encoding · Absolute Position Encodings · Label Smoothing · Convolution · Dropout · Layer Normalization · Softmax
