DETR Doesn't Need Multi-Scale or Locality Design

Yutong Lin; Yuhui Yuan; Zheng Zhang; Chen Li; Nanning Zheng; Han Hu

arXiv:2308.01904·cs.CV·August 4, 2023·1 cites

DETR Doesn't Need Multi-Scale or Locality Design

Yutong Lin, Yuhui Yuan, Zheng Zhang, Chen Li, Nanning Zheng, Han Hu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a simplified DETR object detector that forgoes multi-scale features and locality biases, yet achieves competitive accuracy through innovative position bias and backbone pre-training techniques.

Contribution

It demonstrates that a plain single-scale DETR can match state-of-the-art performance using novel position bias and pre-training strategies.

Findings

01

Achieved 63.9 mAP on Object365 with Swin-L backbone.

02

Simple techniques effectively compensate for lack of multi-scale features.

03

Plain DETR rivals complex multi-scale detectors in accuracy.

Abstract

This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

impiga/plain-detr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Linear Layer · Adam · Dense Connections · Label Smoothing · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Feedforward Network