Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design
Jiannan Huang, Aditya Kane, Fengzhe Zhou, Yunchao Wei, Humphrey Shi

TL;DR
Le-DETR introduces an efficient, high-performance real-time detection transformer that reduces pre-training costs by using modern efficient architectures and local attention, achieving state-of-the-art results with less data and computation.
Contribution
The paper proposes Le-DETR, a novel real-time detection transformer with an efficient encoder design that significantly reduces pre-training overhead while maintaining high accuracy.
Findings
Le-DETR achieves 52.9/54.3/55.1 mAP on COCO with fast inference times.
It surpasses YOLOv12-L/X in mAP while maintaining similar speed.
Le-DETR reduces pre-training data by 80% compared to previous methods.
Abstract
Real-time object detection is crucial for real-world applications as it requires high accuracy with low latency. While Detection Transformers (DETR) have demonstrated significant performance improvements, current real-time DETR models are challenging to reproduce from scratch due to excessive pre-training overheads on the backbone, constraining research advancements by hindering the exploration of novel backbone architectures. In this paper, we want to show that by using general good design, it is possible to have \textbf{high performance} with \textbf{low pre-training cost}. After a thorough study of the backbone architecture, we propose EfficientNAT at various scales, which incorporates modern efficient convolution and local attention mechanisms. Moreover, we re-design the hybrid encoder with local attention, significantly enhancing both performance and inference speed. Based on these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
