InvSeg: Test-Time Prompt Inversion for Semantic Segmentation
Jiayi Lin, Jiabo Huang, Jian Hu, Shaogang Gong

TL;DR
InvSeg introduces a test-time prompt inversion technique that enhances open-vocabulary semantic segmentation by aligning visual and textual features through structure-aware prompt enrichment, achieving state-of-the-art results.
Contribution
The paper proposes InvSeg, a novel method that inverts image-specific visual context into text prompts, improving semantic segmentation accuracy across diverse datasets.
Findings
Achieves state-of-the-art performance on PASCAL VOC, PASCAL Context, and COCO datasets.
Utilizes Contrastive Soft Clustering to improve mask distinction and internal consistency.
Effectively aligns visual and textual features for open-vocabulary segmentation.
Abstract
Visual-textual correlations in the attention maps derived from text-to-image diffusion models are proven beneficial to dense visual prediction tasks, e.g., semantic segmentation. However, a significant challenge arises due to the input distributional discrepancy between the context-rich sentences used for image generation and the isolated class names typically used in semantic segmentation. This discrepancy hinders diffusion models from capturing accurate visual-textual correlations. To solve this, we propose InvSeg, a test-time prompt inversion method that tackles open-vocabulary semantic segmentation by inverting image-specific visual context into text prompt embedding space, leveraging structure information derived from the diffusion model's reconstruction process to enrich text prompts so as to associate each class with a structure-consistent mask. Specifically, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Advanced Neural Network Applications · Topic Modeling
MethodsSoftmax · Attention Is All You Need · Diffusion · ALIGN
