Efficient Decoder-free Object Detection with Transformers
Peixian Chen, Mengdan Zhang, Yunhang Shen, Kekai Sheng, Yuting Gao,, Xing Sun, Ke Li, Chunhua Shen

TL;DR
This paper introduces a novel decoder-free transformer-based object detector that simplifies the detection process, achieves high efficiency, and outperforms existing models in accuracy and computational cost on the MS COCO benchmark.
Contribution
The paper proposes a decoder-free, encoder-only transformer architecture for object detection, reducing training time and computational cost while maintaining high accuracy.
Findings
Outperforms DETR by 2.5% AP with 28% less computation and over 10x fewer training epochs.
Achieves over 5.5% AP gain compared to RetinaNet while reducing 70% of computation.
Demonstrates high efficiency and accuracy on the MS COCO benchmark.
Abstract
Vision transformers (ViTs) are changing the landscape of object detection approaches. A natural usage of ViTs in detection is to replace the CNN-based backbone with a transformer-based backbone, which is straightforward and effective, with the price of bringing considerable computation burden for inference. More subtle usage is the DETR family, which eliminates the need for many hand-designed components in object detection but introduces a decoder demanding an extra-long time to converge. As a result, transformer-based object detection can not prevail in large-scale applications. To overcome these issues, we propose a novel decoder-free fully transformer-based (DFFT) object detector, achieving high efficiency in both training and inference stages, for the first time. We simplify objection detection into an encoder-only single-level anchor-based dense prediction problem by centering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Feature Pyramid Network · Label Smoothing · Softmax · Absolute Position Encodings · Dropout · Adam · Residual Connection
