Real-time Transformer-based Open-Vocabulary Detection with Efficient   Fusion Head

Tiancheng Zhao; Peng Liu; Xuan He; Lu Zhang; Kyusong Lee

arXiv:2403.06892·cs.CV·December 3, 2024·3 cites

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee

PDF

Open Access 2 Repos 2 Models

TL;DR

This paper introduces OmDet-Turbo, a real-time transformer-based open-vocabulary object detection model with an efficient fusion head, achieving high speed and competitive accuracy on benchmark datasets.

Contribution

The paper presents OmDet-Turbo, a novel real-time open-vocabulary detection model with an innovative fusion head that significantly improves inference speed and maintains high detection performance.

Findings

01

OmDet-Turbo achieves 100.2 FPS with TensorRT and language cache.

02

It performs nearly on par with state-of-the-art models in zero-shot detection.

03

Sets new benchmarks on ODinW and OVDEval datasets.

Abstract

End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this paper, we scrutinize the limitations of two leading models in the OVDEval benchmark, OmDet and Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based real-time OVD model features an innovative Efficient Fusion Head (EFH) module designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO. Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with TensorRT and language cache techniques applied. Notably, in zero-shot scenarios on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on par…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques