Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation

Yu-Jhe Li; Xinyang Zhang; Kun Wan; Lantao Yu; Ajinkya Kale; Xin Lu

arXiv:2412.10292·cs.CV·December 16, 2024

Prompt-Guided Mask Proposal for Two-Stage Open-Vocabulary Segmentation

Yu-Jhe Li, Xinyang Zhang, Kun Wan, Lantao Yu, Ajinkya Kale, Xin Lu

PDF

TL;DR

This paper introduces a prompt-guided mask proposal method for open-vocabulary segmentation, improving mask alignment with text prompts and enhancing performance across multiple benchmarks.

Contribution

It proposes a novel prompt-guided mask proposal approach using cross-attention, significantly improving existing two-stage open-vocabulary segmentation models.

Findings

01

Achieved 1-3% absolute mIOU improvement on five benchmarks.

02

Demonstrated effective generalization of the prompt-guided approach.

03

Enhanced mask alignment with input text prompts.

Abstract

We tackle the challenge of open-vocabulary segmentation, where we need to identify objects from a wide range of categories in different environments, using text prompts as our input. To overcome this challenge, existing methods often use multi-modal models like CLIP, which combine image and text features in a shared embedding space to bridge the gap between limited and extensive vocabulary recognition, resulting in a two-stage approach: In the first stage, a mask generator takes an input image to generate mask proposals, and the in the second stage the target mask is picked based on the query. However, the expected target mask may not exist in the generated mask proposals, which leads to an unexpected output mask. In our work, we propose a novel approach named Prompt-guided Mask Proposal (PMP) where the mask generator takes the input text prompts and generates masks guided by these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsContrastive Language-Image Pre-training