HDINO: A Concise and Efficient Open-Vocabulary Detector

Hao Zhang; Yiqun Wang; Qinran Lin; Runze Fan; Yong Li

arXiv:2603.02924·cs.CV·March 4, 2026

HDINO: A Concise and Efficient Open-Vocabulary Detector

Hao Zhang, Yiqun Wang, Qinran Lin, Runze Fan, Yong Li

PDF

Open Access

TL;DR

HDINO introduces a concise, resource-efficient open-vocabulary object detection method that leverages a two-stage training strategy with semantic alignment and lightweight feature fusion, achieving state-of-the-art results without manual data curation.

Contribution

The paper presents HDINO, a novel open-vocabulary detector that eliminates reliance on manual datasets and resource-intensive features, using a two-stage transformer-based training approach.

Findings

01

HDINO-T achieves 49.2 mAP on COCO with 2.2M images.

02

HDINO outperforms Grounding DINO-T and T-Rex2 in mAP.

03

Fine-tuned HDINO models reach 56.4 and 59.2 mAP on COCO.

Abstract

Despite the growing interest in open-vocabulary object detection in recent years, most existing methods rely heavily on manually curated fine-grained training datasets as well as resource-intensive layer-wise cross-modal feature extraction. In this paper, we propose HDINO, a concise yet efficient open-vocabulary object detector that eliminates the dependence on these components. Specifically, we propose a two-stage training strategy built upon the transformer-based DINO model. In the first stage, noisy samples are treated as additional positive object instances to construct a One-to-Many Semantic Alignment Mechanism(O2M) between the visual and textual modalities, thereby facilitating semantic alignment. A Difficulty Weighted Classification Loss (DWCL) is also designed based on initial detection difficulty to mine hard examples and further improve model performance. In the second stage,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques