HDINO: A Concise and Efficient Open-Vocabulary Detector
Hao Zhang, Yiqun Wang, Qinran Lin, Runze Fan, Yong Li

TL;DR
HDINO introduces a concise, resource-efficient open-vocabulary object detection method that leverages a two-stage training strategy with semantic alignment and lightweight feature fusion, achieving state-of-the-art results without manual data curation.
Contribution
The paper presents HDINO, a novel open-vocabulary detector that eliminates reliance on manual datasets and resource-intensive features, using a two-stage transformer-based training approach.
Findings
HDINO-T achieves 49.2 mAP on COCO with 2.2M images.
HDINO outperforms Grounding DINO-T and T-Rex2 in mAP.
Fine-tuned HDINO models reach 56.4 and 59.2 mAP on COCO.
Abstract
Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
