Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity
Byungseok Roh, JaeWoong Shin, Wuhyun Shin, Saehoon Kim

TL;DR
Sparse DETR introduces a selective token update mechanism in transformer-based object detection, significantly reducing computation and increasing speed while maintaining or improving detection performance.
Contribution
It proposes a novel sparse token updating strategy in transformer encoders for object detection, improving efficiency without sacrificing accuracy.
Findings
Achieves better performance than Deformable DETR with only 10% encoder tokens.
Reduces total computation cost by 38%.
Increases FPS by 42%.
Abstract
DETR is the first end-to-end object detector using a transformer encoder-decoder architecture and demonstrates competitive performance but low computational efficiency on high resolution feature maps. The subsequent work, Deformable DETR, enhances the efficiency of DETR by replacing dense attention with deformable attention, which achieves 10x faster convergence and improved performance. Deformable DETR uses the multiscale feature to ameliorate performance, however, the number of encoder tokens increases by 20x compared to DETR, and the computation cost of the encoder attention remains a bottleneck. In our preliminary experiment, we observe that the detection performance hardly deteriorates even if only a part of the encoder token is updated. Inspired by this observation, we propose Sparse DETR that selectively updates only the tokens expected to be referenced by the decoder, thus help…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Deformable Attention Module · Label Smoothing · Softmax · Convolution · Residual Connection · Feedforward Network
