ZegOT: Zero-shot Segmentation Through Optimal Transport of Text Prompts
Kwanyoung Kim, Yujin Oh, Jong Chul Ye

TL;DR
ZegOT introduces a novel zero-shot segmentation method that uses optimal transport to match multiple text prompts with frozen image features, achieving state-of-the-art results without retraining CLIP.
Contribution
The paper presents a new optimal transport-based approach with a multiple prompt solver for zero-shot segmentation, avoiding additional training or image encoders.
Findings
Achieves state-of-the-art zero-shot segmentation performance.
Effectively aligns multiple text prompts with visual features.
Operates without retraining or modifying the CLIP model.
Abstract
Recent success of large-scale Contrastive Language-Image Pre-training (CLIP) has led to great promise in zero-shot semantic segmentation by transferring image-text aligned knowledge to pixel-level classification. However, existing methods usually require an additional image encoder or retraining/tuning the CLIP module. Here, we propose a novel Zero-shot segmentation with Optimal Transport (ZegOT) method that matches multiple text prompts with frozen image embeddings through optimal transport. In particular, we introduce a novel Multiple Prompt Optimal Transport Solver (MPOT), which is designed to learn an optimal mapping between multiple text prompts and visual feature maps of the frozen image encoder hidden layers. This unique mapping method facilitates each of the multiple text prompts to effectively focus on distinct visual semantic attributes. Through extensive experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsContrastive Language-Image Pre-training
