Just a Few Glances: Open-Set Visual Perception with Image Prompt   Paradigm

Jinrong Zhang; Penghui Wang; Chunxiao Liu; Wei Liu; Dian Jin; Qiong; Zhang; Erli Meng; Zhengnan Hu

arXiv:2412.10719·cs.CV·December 17, 2024

Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm

Jinrong Zhang, Penghui Wang, Chunxiao Liu, Wei Liu, Dian Jin, Qiong, Zhang, Erli Meng, Zhengnan Hu

PDF

Open Access

TL;DR

This paper introduces the Image Prompt Paradigm for open-set visual perception tasks, enabling detection and segmentation of specialized categories using only a few image prompts without human interaction, and demonstrates competitive performance.

Contribution

It proposes a novel image prompt paradigm and a framework called MI Grounding that automatically encodes, selects, and fuses image prompts for non-interactive open-set detection and segmentation.

Findings

01

Achieves competitive results on public OSOD and OSS benchmarks.

02

Outperforms existing methods on the ADR50K dataset.

03

Enables fully automated detection and segmentation with minimal prompts.

Abstract

To break through the limitations of pre-training models on fixed categories, Open-Set Object Detection (OSOD) and Open-Set Segmentation (OSS) have attracted a surge of interest from researchers. Inspired by large language models, mainstream OSOD and OSS methods generally utilize text as a prompt, achieving remarkable performance. Following SAM paradigm, some researchers use visual prompts, such as points, boxes, and masks that cover detection or segmentation targets. Despite these two prompt paradigms exhibit excellent performance, they also reveal inherent limitations. On the one hand, it is difficult to accurately describe characteristics of specialized category using textual description. On the other hand, existing visual prompt paradigms heavily rely on multi-round human interaction, which hinders them being applied to fully automated pipeline. To address the above issues, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques · Image Processing Techniques and Applications

MethodsSegment Anything Model