Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection
Manyuan Zhang, Guanglu Song, Yu Liu, Hongsheng Li

TL;DR
This paper introduces SD-DETR, a novel approach that spatially decouples localization and classification in DETR, leading to significant performance improvements by addressing task misalignment.
Contribution
The paper proposes a task-aware query generation and disentangled feature learning scheme to decouple localization and classification in DETR, improving detection accuracy.
Findings
Achieves 4.5 AP improvement on MSCOCO with Conditional DETR
Effectively reduces task misalignment between classification and localization
Demonstrates significant performance gains over previous DETR variants
Abstract
The introduction of DETR represents a new paradigm for object detection. However, its decoder conducts classification and box localization using shared queries and cross-attention layers, leading to suboptimal results. We observe that different regions of interest in the visual feature map are suitable for performing query classification and box localization tasks, even for the same object. Salient regions provide vital information for classification, while the boundaries around them are more favorable for box regression. Unfortunately, such spatial misalignment between these two tasks greatly hinders DETR's training. Therefore, in this work, we focus on decoupling localization and classification tasks in DETR. To achieve this, we introduce a new design scheme called spatially decoupled DETR (SD-DETR), which includes a task-aware query generation module and a disentangled feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Byte Pair Encoding · Dropout · Layer Normalization
