TL;DR
CP-DETR introduces a universal object detection model that effectively utilizes concept prompts and hybrid encoding to improve zero-shot and few-shot detection across diverse scenarios.
Contribution
The paper proposes a novel prompt-guided hybrid encoder and concept prompt generation methods, enhancing universal detection performance with a single pre-trained model.
Findings
Achieves 47.6 zero-shot AP on LVIS with Swin-T backbone.
Attains 68.4 AP on COCO val with visual prompts.
Reaches 73.1 fully-shot AP on ODinW13 with optimized prompts.
Abstract
Recent research on universal object detection aims to introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
