Towards Data-Efficient Detection Transformers
Wen Wang, Jing Zhang, Yang Cao, Yongliang Shen, Dacheng Tao

TL;DR
This paper identifies the data inefficiency of detection transformers on small datasets and proposes a simple modification to improve their performance by focusing on local feature sampling and label augmentation.
Contribution
It introduces a minimal modification to detection transformers' cross-attention mechanism and a label augmentation method to enhance data efficiency on small datasets.
Findings
Improved detection transformer performance on small datasets.
Effective simple modifications applicable to various models.
Enhanced data efficiency with minimal changes.
Abstract
Detection Transformers have achieved competitive performance on the sample-rich COCO dataset. However, we show most of them suffer from significant performance drops on small-size datasets, like Cityscapes. In other words, the detection transformers are generally data-hungry. To tackle this problem, we empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR. The empirical results suggest that sparse feature sampling from local image areas holds the key. Based on this observation, we alleviate the data-hungry issue of existing detection transformers by simply alternating how key and value sequences are constructed in the cross-attention layer, with minimum modifications to the original models. Besides, we introduce a simple yet effective label augmentation method to provide richer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Remote-Sensing Image Classification
