OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision
Junjie Wang, Bin Chen, Bin Kang, Yulin Li, YiChi Chen, Weizhi Xian,, Huifeng Chang, Yong Xu

TL;DR
This paper introduces OV-DQUO, a novel open-vocabulary detection method that uses denoising text query training and unknown object supervision to improve detection of novel categories, achieving state-of-the-art results.
Contribution
The paper proposes a wildcard matching method and a denoising text query training strategy to enhance open-vocabulary detection, addressing confidence bias and background confusion issues.
Findings
Achieved 45.6 AP50 on OV-COCO benchmark.
Achieved 39.3 mAP on OV-LVIS benchmark.
Outperformed previous methods without extra training data.
Abstract
Open-vocabulary detection aims to detect objects from novel categories beyond the base categories on which the detector is trained. However, existing open-vocabulary detectors trained on base category data tend to assign higher confidence to trained categories and confuse novel categories with the background. To resolve this, we propose OV-DQUO, an \textbf{O}pen-\textbf{V}ocabulary DETR with \textbf{D}enoising text \textbf{Q}uery training and open-world \textbf{U}nknown \textbf{O}bjects supervision. Specifically, we introduce a wildcard matching method. This method enables the detector to learn from pairs of unknown objects recognized by the open-world detector and text embeddings with general semantics, mitigating the confidence bias between base and novel categories. Additionally, we propose a denoising text query training strategy. It synthesizes foreground and background query-box…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections
