RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models
Zijun Liao, Yian Zhao, Xin Shan, Yu Yan, Chang Liu, Lei Lu, Xiangyang Ji, Jie Chen

TL;DR
This paper introduces RT-DETRv4, a novel framework that leverages Vision Foundation Models to enhance lightweight real-time object detectors, achieving state-of-the-art accuracy without additional inference costs.
Contribution
The paper proposes a cost-effective distillation framework with a Deep Semantic Injector and Gradient-guided Adaptive Modulation to improve lightweight detectors using VFMs.
Findings
Achieves state-of-the-art COCO AP scores of 49.7/53.5/55.4/57.0.
Maintains high inference speeds of 273/169/124/78 FPS.
Delivers consistent performance gains across DETR-based models.
Abstract
Real-time object detection has achieved substantial progress through meticulously designed architectures and optimization strategies. However, the pursuit of high-speed inference via lightweight network designs often leads to degraded feature representation, which hinders further performance improvements and practical on-device deployment. In this paper, we propose a cost-effective and highly adaptable distillation framework that harnesses the rapidly evolving capabilities of Vision Foundation Models (VFMs) to enhance lightweight object detectors. Given the significant architectural and learning objective disparities between VFMs and resource-constrained detectors, achieving stable and task-aligned semantic transfer is challenging. To address this, on one hand, we introduce a Deep Semantic Injector (DSI) module that facilitates the integration of high-level representations from VFMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
