Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale Feature Fusion
Gongjie Zhang, Zhipeng Luo, Jiaxing Huang, Shijian Lu, Eric P. Xing

TL;DR
This paper introduces SAM-DETR++, a plug-and-play module that aligns semantics between object queries and image features, significantly accelerating DETR's convergence and enhancing multi-scale feature fusion for improved object detection performance.
Contribution
SAM-DETR++ is a novel semantic-aligned matching module that improves DETR's convergence speed and detection accuracy by aligning feature semantics and effectively fusing multi-scale features.
Findings
Achieves 44.8% AP with only 12 training epochs.
Attains 49.1% AP after 50 epochs on COCO.
Outperforms existing DETR variants in convergence speed and accuracy.
Abstract
The recently proposed DEtection TRansformer (DETR) has established a fully end-to-end paradigm for object detection. However, DETR suffers from slow training convergence, which hinders its applicability to various detection tasks. We observe that DETR's slow convergence is largely attributed to the difficulty in matching object queries to relevant regions due to the unaligned semantics between object queries and encoded image features. With this observation, we design Semantic-Aligned-Matching DETR++ (SAM-DETR++) to accelerate DETR's convergence and improve detection performance. The core of SAM-DETR++ is a plug-and-play module that projects object queries and encoded image features into the same feature embedding space, where each object query can be easily matched to relevant regions with similar semantics. Besides, SAM-DETR++ searches for multiple representative keypoints and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Label Smoothing
