What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation
Jianghang Lin, Yue Hu, Jiangtao Shen, Yunhang Shen, Liujuan Cao, Shengchuan Zhang, Rongrong Ji

TL;DR
This paper introduces a cognition-inspired framework for open vocabulary image segmentation that emulates human visual recognition by generating object concepts, enhancing semantic understanding, and improving segmentation accuracy across diverse datasets.
Contribution
It proposes a novel framework with a generative vision-language model, concept-aware visual enhancer, and cognition-inspired decoder, bridging the gap between region segmentation and semantic recognition.
Findings
Achieves 27.2 PQ, 17.0 mAP, 35.3 mIoU on A-150 dataset.
Attains high performance on Cityscapes, Mapillary Vistas, and other benchmarks.
Supports vocabulary-free segmentation for recognizing unseen categories.
Abstract
Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system's process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and target concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-Language Model (G-VLM) that mimics human cognition by generating object concepts to provide semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
