TL;DR
DETR-ViP introduces a robust detection framework that enhances visual prompts' discriminability, significantly improving open-vocabulary object detection performance across multiple datasets.
Contribution
The paper proposes novel methods for global prompt integration and visual-textual prompt relation distillation to improve visual prompt discriminability in object detection.
Findings
DETR-ViP outperforms state-of-the-art methods on COCO, LVIS, ODinW, and Roboflow100.
Incorporating global prompt integration improves detection accuracy.
Visual-textual prompt relation distillation enhances class discriminability.
Abstract
Visual prompted object detection enables interactive and flexible definition of target categories, thereby facilitating open-vocabulary detection. Since visual prompts are derived directly from image features, they often outperform text prompts in recognizing rare categories. Nevertheless, research on visual prompted detection has been largely overlooked, and it is typically treated as a byproduct of training text prompted detectors, which hinders its development. To fully unlock the potential of visual-prompted detection, we investigate the reasons why its performance is suboptimal and reveal that the underlying issue lies in the absence of global discriminability in visual prompts. Motivated by these observations, we propose DETR-ViP, a robust object detection framework that yields class-distinguishable visual prompts. On top of basic image-text contrastive learning, DETR-ViP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
