DETR Doesn't Need Multi-Scale or Locality Design
Yutong Lin, Yuhui Yuan, Zheng Zhang, Chen Li, Nanning Zheng, Han Hu

TL;DR
This paper introduces a simplified DETR object detector that forgoes multi-scale features and locality biases, yet achieves competitive accuracy through innovative position bias and backbone pre-training techniques.
Contribution
It demonstrates that a plain single-scale DETR can match state-of-the-art performance using novel position bias and pre-training strategies.
Findings
Achieved 63.9 mAP on Object365 with Swin-L backbone.
Simple techniques effectively compensate for lack of multi-scale features.
Plain DETR rivals complex multi-scale detectors in accuracy.
Abstract
This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Linear Layer · Adam · Dense Connections · Label Smoothing · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Feedforward Network
