DEYOv3: DETR with YOLO for Real-time Object Detection
Haodong Ouyang

TL;DR
DEYOv3 introduces a step-by-step training approach for real-time object detection that eliminates the need for ImageNet pretraining, reducing costs and improving accuracy and speed over existing methods.
Contribution
The paper proposes a novel step-by-step training method for DETR-like models, enabling flexible backbone design and higher accuracy without additional datasets.
Findings
DEYOv3 achieves 41.1% AP at 270 FPS on COCO.
DEYOv3-L achieves 51.3% AP at 102 FPS.
Training can be completed on a single RTX3090 GPU.
Abstract
Recently, end-to-end object detectors have gained significant attention from the research community due to their outstanding performance. However, DETR typically relies on supervised pretraining of the backbone on ImageNet, which limits the practical application of DETR and the design of the backbone, affecting the model's potential generalization ability. In this paper, we propose a new training method called step-by-step training. Specifically, in the first stage, the one-to-many pre-trained YOLO detector is used to initialize the end-to-end detector. In the second stage, the backbone and encoder are consistent with the DETR-like model, but only the detector needs to be trained from scratch. Due to this training method, the object detector does not need the additional dataset (ImageNet) to train the backbone, which makes the design of the backbone more flexible and dramatically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · COVID-19 diagnosis using AI · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Residual Connection · Adam · Feedforward Network · Dropout · Linear Layer · Layer Normalization · Label Smoothing · Multi-Head Attention · Byte Pair Encoding
