Open Vocabulary Panoptic Segmentation With Retrieval Augmentation
Nafis Sadeq, Qingfeng Liu, Mostafa El-Khamy

TL;DR
This paper introduces RetCLIP, a retrieval-augmented method for open vocabulary panoptic segmentation that enhances the ability to segment unseen classes by combining feature retrieval with CLIP-based classification, significantly improving performance.
Contribution
The paper proposes RetCLIP, a novel retrieval-augmented approach that improves open vocabulary panoptic segmentation, especially for unseen classes, by leveraging a masked segment feature database and combining retrieval with CLIP scores.
Findings
Achieves 30.9 PQ on ADE20k, outperforming baseline by 4.5 points.
Demonstrates improved mAP and mIoU metrics, indicating better segmentation quality.
Effectively generalizes to unseen classes beyond training data.
Abstract
Given an input image and set of class names, panoptic segmentation aims to label each pixel in an image with class labels and instance labels. In comparison, Open Vocabulary Panoptic Segmentation aims to facilitate the segmentation of arbitrary classes according to user input. The challenge is that a panoptic segmentation system trained on a particular dataset typically does not generalize well to unseen classes beyond the training data. In this work, we propose RetCLIP, a retrieval-augmented panoptic segmentation method that improves the performance of unseen classes. In particular, we construct a masked segment feature database using paired image-text data. At inference time, we use masked segment features from the input image as query keys to retrieve similar features and associated class labels from the database. Classification scores for the masked segment are assigned based on the…
Peer Reviews
Decision·Submitted to ICLR 2025
- The reviewer likes the integration of retrieval-based classification with CLIP-based scores to address the domain shift issues between masked images and natural images. It clearly improves the model's ability to recognize unseen classes without additional training. - The paper's approach to construct a feature database from widely available paired image-text data is interesting. This setup enables adaptability without requiring pixel-level annotations. - The paper is well-organized and well-
- The reviewer feels that the retrieval-based classification relies heavily on the quality and diversity of the feature database constructed from paired image-text data. If the database lacks sufficient variety or coverage, the method may struggle to classify certain unseen classes accurately, particularly in real-world scenarios with a wide range of objects. - Further, the reviewer observed that the method uses Grounding DINO and SAM for generating masks in the training-free setup. However, SA
1. The paper introduces a creative solution to the open vocabulary panoptic segmentation problem by combining retrieval-based classification with CLIP, which is an original approach not commonly seen in the literature. 2. The paper is well-structured, with clear explanations of the methodology.
1. While the paper demonstrates improvements over the baseline, it does not provide a direct comparison with other state-of-the-art methods in the field, which could provide additional context for the significance of the results. 2. The discussion on how the proposed method generalizes to unseen classes could be expanded, as this is a critical aspect of open vocabulary segmentation. 3. The paper could further discuss the limitations of the retrieval-augmented approach, especially regarding the r
Applying Retrieval Augmentation to vision tasks is a promising direction. The proposed way of constructing a database is interesting.
1. While the method builds on FC-CLIP, the authors do not provide an introduction to FC-CLIP, which makes the paper hard to follow during reading. 2. The feature database should be introduced prior to discussing the retrieval method to improve the flow and clarity of the paper. 3. Since retrieval augmentation is intended to be a more general approach, the paper would benefit from presenting a more generalized framework to reflect its broader applicability. 4. The method of constructing the featu
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications
