Detection Transformer with Stable Matching
Shilong Liu, Tianhe Ren, Jiayu Chen, Zhaoyang Zeng, Hao Zhang, Feng, Li, Hongyang Li, Jun Huang, Hang Su, Jun Zhu, Lei Zhang

TL;DR
This paper identifies the cause of unstable matching in DETR models and proposes simple modifications using positional metrics, leading to significant performance improvements on COCO detection benchmarks.
Contribution
The paper introduces position-supervised loss and position-modulated cost to stabilize matching in DETR, enhancing detection accuracy across various models.
Findings
Achieves 50.4 AP on COCO with ResNet-50 in 12 epochs
Sets new records with 51.5 AP in 24 epochs
Attains 63.8 AP on COCO test-dev with Swin-Large backbone
Abstract
This paper is concerned with the matching stability problem across different decoder layers in DEtection TRansformers (DETR). We point out that the unstable matching in DETR is caused by a multi-optimization path problem, which is highlighted by the one-to-one matching design in DETR. To address this problem, we show that the most important design is to use and only use positional metrics (like IOU) to supervise classification scores of positive examples. Under the principle, we propose two simple yet effective modifications by integrating positional metrics to DETR's classification loss and matching cost, named position-supervised loss and position-modulated cost. We verify our methods on several DETR variants. Our methods show consistent improvements over baselines. By integrating our methods with DINO, we achieve 50.4 and 51.5 AP on the COCO detection benchmark using ResNet-50…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Detection Transformer with Stable Matching· youtube
Taxonomy
TopicsDigital Media Forensic Detection · Wireless Signal Modulation Classification · Anomaly Detection Techniques and Applications
MethodsMulti-Head Attention · Attention Is All You Need · Vision Transformer · Dropout · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Softmax · Linear Layer · Byte Pair Encoding · Layer Normalization
