TL;DR
PET-DINO introduces a universal object detector that leverages prompt-enriched training and alignment-friendly visual prompts to improve zero-shot detection across diverse scenarios.
Contribution
It proposes a novel unified detection framework with prompt-enriched training strategies, reducing development complexity and enhancing zero-shot detection performance.
Findings
PET-DINO achieves competitive zero-shot detection results.
The proposed training strategies enable effective modeling of multiple prompts.
The framework supports both text and visual prompts for versatile detection.
Abstract
Open-Set Object Detection (OSOD) enables recognition of novel categories beyond fixed classes but faces challenges in aligning text representations with complex visual concepts and the scarcity of image-text pairs for rare categories. This results in suboptimal performance in specialized domains or with complex objects. Recent visual-prompted methods partially address these issues but often involve complex multi-modal designs and multi-stage optimizations, prolonging the development cycle. Additionally, effective training strategies for data-driven OSOD models remain largely unexplored. To address these challenges, we propose PET-DINO, a universal detector supporting both text and visual prompts. Our Alignment-Friendly Visual Prompt Generation (AFVPG) module builds upon an advanced text-prompted detector, addressing the limitations of text representation guidance and reducing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
