Generalizable Semantic Vision Query Generation for Zero-shot Panoptic and Semantic Segmentation
Jialei Chen, Daisuke Deguchi, Chenkai Zhang, Hiroshi Murase

TL;DR
This paper introduces CONCAT, a novel method for zero-shot panoptic segmentation that improves generalization to unseen categories by aligning visual and semantic features and synthesizing detailed pseudo-queries, achieving state-of-the-art results.
Contribution
The paper proposes CONCAT, a new approach combining feature alignment and semantic-vision training to enhance zero-shot segmentation performance and speed.
Findings
Achieves 5.2% higher hPQ than previous SOTA.
Effective in inductive ZPS and open-vocabulary segmentation.
Runs twice as fast during testing.
Abstract
Zero-shot Panoptic Segmentation (ZPS) aims to recognize foreground instances and background stuff without images containing unseen categories in training. Due to the visual data sparsity and the difficulty of generalizing from seen to unseen categories, this task remains challenging. To better generalize to unseen classes, we propose Conditional tOken aligNment and Cycle trAnsiTion (CONCAT), to produce generalizable semantic vision queries. First, a feature extractor is trained by CON to link the vision and semantics for providing target queries. Formally, CON is proposed to align the semantic queries with the CLIP visual CLS token extracted from complete and masked images. To address the lack of unseen categories, a generator is required. However, one of the gaps in synthesizing pseudo vision queries, ie, vision queries for unseen categories, is describing fine-grained visual details…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Robotics and Sensor-Based Localization
MethodsContrastive Language-Image Pre-training · ALIGN
