Multi-Modal Prototypes for Open-World Semantic Segmentation
Yuhuan Yang, Chaofan Ma, Chen Ju, Fei Zhang, Jiangchao Yao, Ya Zhang,, Yanfeng Wang

TL;DR
This paper introduces a multi-modal prototype framework for open-world semantic segmentation, integrating visual and textual clues to improve generalization to unseen categories in a unified architecture.
Contribution
It proposes a novel multi-modal prototype-based segmentation method that decomposes language information into multiple aspects and fuses it with visual data for enhanced open-world segmentation.
Findings
Outperforms previous state-of-the-art on PASCAL-5^i and COCO-20^i datasets.
Effective in zero-shot, few-shot, and generalized segmentation tasks.
Ablation studies confirm the importance of each component.
Abstract
In semantic segmentation, generalizing a visual system to both seen categories and novel categories at inference time has always been practically valuable yet challenging. To enable such functionality, existing methods mainly rely on either providing several support demonstrations from the visual aspect or characterizing the informative clues from the textual aspect (e.g., the class names). Nevertheless, both two lines neglect the complementary intrinsic of low-level visual and high-level language information, while the explorations that consider visual and textual modalities as a whole to promote predictions are still limited. To close this gap, we propose to encompass textual and visual clues as multi-modal prototypes to allow more comprehensive support for open-world semantic segmentation, and build a novel prototype-based segmentation framework to realize this promise. To be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
