Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection
Yanxin Long, Jianhua Han, Runhui Huang, Xu Hang, Yi Zhu, Chunjing Xu,, Xiaodan Liang

TL;DR
This paper introduces a fine-grained visual-text prompt-driven self-training method that significantly improves open-vocabulary object detection by enhancing alignment between visual features and text prompts, achieving state-of-the-art results.
Contribution
It proposes a novel fine-grained alignment approach using learnable text prompts and a visual prompt module to adapt pre-trained VLMs for open-vocabulary detection.
Findings
Achieves 31.5% mAP on unseen COCO classes
Enhances fine-grained alignment between visual and textual features
Outperforms previous state-of-the-art methods in open-vocabulary detection
Abstract
Inspired by the success of vision-language methods (VLMs) in zero-shot classification, recent works attempt to extend this line of work into object detection by leveraging the localization ability of pre-trained VLMs and generating pseudo labels for unseen classes in a self-training manner. However, since the current VLMs are usually pre-trained with aligning sentence embedding with global image embedding, the direct use of them lacks fine-grained alignment for object instances, which is the core of detection. In this paper, we propose a simple but effective fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) that introduces a fine-grained visual-text prompt adapting stage to enhance the current self-training paradigm with a more powerful fine-grained alignment. During the adapting stage, we enable VLM to obtain fine-grained alignment by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
