Box-DETR: Understanding and Boxing Conditional Spatial Queries
Wenze Liu, Hao Lu, Yuliang Liu, Zhiguo Cao

TL;DR
This paper introduces Box-DETR, which enhances conditional spatial queries in DETR by using box agent points for better reference, leading to faster convergence and improved detection accuracy.
Contribution
It proposes Box Agent to incorporate full box information into cross-attention, significantly improving DETR's performance with minimal computational overhead.
Findings
Faster convergence in object detection models.
Improved detection accuracy with Box Agent.
Achieved 44.2 AP on ResNet-50 with single-scale model.
Abstract
Conditional spatial queries are recently introduced into DEtection TRansformer (DETR) to accelerate convergence. In DAB-DETR, such queries are modulated by the so-called conditional linear projection at each decoder stage, aiming to search for positions of interest such as the four extremities of the box. Each decoder stage progressively updates the box by predicting the anchor box offsets, while in cross-attention only the box center is informed as the reference point. The use of only box center, however, leaves the width and height of the previous box unknown to the current stage, which hinders accurate prediction of offsets. We argue that the explicit use of the entire box information in cross-attention matters. In this work, we propose Box Agent to condense the box into head-specific agent points. By replacing the box center with the agent point as the reference point in each head,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Anomaly Detection Techniques and Applications
