WeDetect: Fast Open-Vocabulary Object Detection as Retrieval
Shenghao Fu, Yukun Su, Fengyun Rao, Jing Lyu, Xiaohua Xie, Wei-Shi Zheng

TL;DR
WeDetect introduces a fast, open-vocabulary object detection framework that leverages retrieval-based methods, achieving state-of-the-art performance and versatile applications including object retrieval and referring expression comprehension.
Contribution
The paper presents a non-fusion, retrieval-based detection model family that surpasses fusion models in speed and accuracy, and introduces new applications like historical data retrieval and REC.
Findings
State-of-the-art open-vocabulary detection performance
Real-time detection with a dual-tower architecture
Effective retrieval of objects in historical data
Abstract
Open-vocabulary object detection aims to detect arbitrary classes via text prompts. Methods without cross-modal fusion layers (non-fusion) offer faster inference by treating recognition as a retrieval problem, \ie, matching regions to text queries in a shared embedding space. In this work, we fully explore this retrieval philosophy and demonstrate its unique advantages in efficiency and versatility through a model family named WeDetect: (1) State-of-the-art performance. WeDetect is a real-time detector with a dual-tower architecture. We show that, with well-curated data and full training, the non-fusion WeDetect surpasses other fusion models and establishes a strong open-vocabulary foundation. (2) Fast backtrack of historical data. WeDetect-Uni is a universal proposal generator based on WeDetect. We freeze the entire detector and only finetune an objectness prompt to retrieve generic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
