OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion
Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong, Qian, Lin Ma, Dongmei Jiang, Yaowei Wang, Xiangyuan Lan, Xiaodan Liang

TL;DR
OV-DINO introduces a unified framework with language-aware selective fusion for open-vocabulary detection, effectively reducing pseudo-label noise and enhancing cross-modality alignment, achieving state-of-the-art zero-shot detection results.
Contribution
It presents a novel unified training pipeline and a language-aware fusion module, improving open-vocabulary detection by better data integration and cross-modality alignment.
Findings
Achieves 50.6% AP on COCO zero-shot detection
Attains 40.1% AP on LVIS zero-shot detection
Outperforms existing methods in fine-tuned detection on COCO
Abstract
Open-vocabulary detection is a challenging task due to the requirement of detecting objects based on class names, including those not encountered during training. Existing methods have shown strong zero-shot detection capabilities through pre-training and pseudo-labeling on diverse large-scale datasets. However, these approaches encounter two main challenges: (i) how to effectively eliminate data noise from pseudo-labeling, and (ii) how to efficiently leverage the language-aware capability for region-level cross-modality fusion and alignment. To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework. Specifically, we introduce a Unified Data Integration (UniDI) pipeline to enable end-to-end training and eliminate noise from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
