Generalizable Semantic Vision Query Generation for Zero-shot Panoptic   and Semantic Segmentation

Jialei Chen; Daisuke Deguchi; Chenkai Zhang; Hiroshi Murase

arXiv:2402.13697·cs.CV·February 22, 2024·1 cites

Generalizable Semantic Vision Query Generation for Zero-shot Panoptic and Semantic Segmentation

Jialei Chen, Daisuke Deguchi, Chenkai Zhang, Hiroshi Murase

PDF

Open Access

TL;DR

This paper introduces CONCAT, a novel method for zero-shot panoptic segmentation that improves generalization to unseen categories by aligning visual and semantic features and synthesizing detailed pseudo-queries, achieving state-of-the-art results.

Contribution

The paper proposes CONCAT, a new approach combining feature alignment and semantic-vision training to enhance zero-shot segmentation performance and speed.

Findings

01

Achieves 5.2% higher hPQ than previous SOTA.

02

Effective in inductive ZPS and open-vocabulary segmentation.

03

Runs twice as fast during testing.

Abstract

Zero-shot Panoptic Segmentation (ZPS) aims to recognize foreground instances and background stuff without images containing unseen categories in training. Due to the visual data sparsity and the difficulty of generalizing from seen to unseen categories, this task remains challenging. To better generalize to unseen classes, we propose Conditional tOken aligNment and Cycle trAnsiTion (CONCAT), to produce generalizable semantic vision queries. First, a feature extractor is trained by CON to link the vision and semantics for providing target queries. Formally, CON is proposed to align the semantic queries with the CLIP visual CLS token extracted from complete and masked images. To address the lack of unseen categories, a generator is required. However, one of the gaps in synthesizing pseudo vision queries, ie, vision queries for unseen categories, is describing fine-grained visual details…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Robotics and Sensor-Based Localization

MethodsContrastive Language-Image Pre-training · ALIGN