Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head
Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee

TL;DR
This paper introduces OmDet-Turbo, a real-time transformer-based open-vocabulary object detection model with an efficient fusion head, achieving high speed and competitive accuracy on benchmark datasets.
Contribution
The paper presents OmDet-Turbo, a novel real-time open-vocabulary detection model with an innovative fusion head that significantly improves inference speed and maintains high detection performance.
Findings
OmDet-Turbo achieves 100.2 FPS with TensorRT and language cache.
It performs nearly on par with state-of-the-art models in zero-shot detection.
Sets new benchmarks on ODinW and OVDEval datasets.
Abstract
End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this paper, we scrutinize the limitations of two leading models in the OVDEval benchmark, OmDet and Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based real-time OVD model features an innovative Efficient Fusion Head (EFH) module designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO. Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with TensorRT and language cache techniques applied. Notably, in zero-shot scenarios on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on par…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
