Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm
Jinrong Zhang, Penghui Wang, Chunxiao Liu, Wei Liu, Dian Jin, Qiong, Zhang, Erli Meng, Zhengnan Hu

TL;DR
This paper introduces the Image Prompt Paradigm for open-set visual perception tasks, enabling detection and segmentation of specialized categories using only a few image prompts without human interaction, and demonstrates competitive performance.
Contribution
It proposes a novel image prompt paradigm and a framework called MI Grounding that automatically encodes, selects, and fuses image prompts for non-interactive open-set detection and segmentation.
Findings
Achieves competitive results on public OSOD and OSS benchmarks.
Outperforms existing methods on the ADR50K dataset.
Enables fully automated detection and segmentation with minimal prompts.
Abstract
To break through the limitations of pre-training models on fixed categories, Open-Set Object Detection (OSOD) and Open-Set Segmentation (OSS) have attracted a surge of interest from researchers. Inspired by large language models, mainstream OSOD and OSS methods generally utilize text as a prompt, achieving remarkable performance. Following SAM paradigm, some researchers use visual prompts, such as points, boxes, and masks that cover detection or segmentation targets. Despite these two prompt paradigms exhibit excellent performance, they also reveal inherent limitations. On the one hand, it is difficult to accurately describe characteristics of specialized category using textual description. On the other hand, existing visual prompt paradigms heavily rely on multi-round human interaction, which hinders them being applied to fully automated pipeline. To address the above issues, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Image Processing Techniques and Applications
MethodsSegment Anything Model
