Align-DETR: Enhancing End-to-end Object Detection with Aligned Loss
Zhi Cai, Songtao Liu, Guodong Wang, Zheng Ge, Xiangyu Zhang, Di, Huang

TL;DR
Align-DETR introduces a novel aligned loss function to address misalignments in DETR, significantly improving convergence and detection accuracy in end-to-end object detection tasks.
Contribution
The paper proposes Align Loss to resolve classification-regression and cross-layer misalignments, enhancing DETR's performance and robustness with a joint quality metric and intermediate layer supervision.
Findings
Achieves 50.5% AP in 1x setting, surpassing previous methods.
Sets new state-of-the-art performance in object detection.
Improves convergence stability and detection accuracy.
Abstract
DETR has set up a simple end-to-end pipeline for object detection by formulating this task as a set prediction problem, showing promising potential. Despite its notable advancements, this paper identifies two key forms of misalignment within the model: classification-regression misalignment and cross-layer target misalignment. Both issues impede DETR's convergence and degrade its overall performance. To tackle both issues simultaneously, we introduce a novel loss function, termed as Align Loss, designed to resolve the discrepancy between the two tasks. Align Loss guides the optimization of DETR through a joint quality metric, strengthening the connection between classification and regression. Furthermore, it incorporates an exponential down-weighting term to facilitate a smooth transition from positive to negative samples. Align-DETR also employs many-to-one matching for supervision of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Vision Transformer · Linear Layer · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Absolute Position Encodings · Residual Connection
