A Language-Guided Benchmark for Weakly Supervised Open Vocabulary Semantic Segmentation
Prashant Pandey, Mustafa Chasmai, Monish Natarajan, Brejesh Lall

TL;DR
This paper introduces WLSegNet, a language-guided weakly supervised segmentation model that effectively performs open vocabulary segmentation tasks without pixel labels, outperforming existing methods on standard benchmarks.
Contribution
The paper presents a novel weakly supervised pipeline that leverages frozen CLIP features and context vectors to enable zero-shot and few-shot segmentation without pixel labels, avoiding fine-tuning.
Findings
WLSegNet outperforms existing methods by 39 mIOU on PASCAL VOC.
Achieves 3 mIOU improvement in weak FSS on PASCAL VOC.
Beats baselines by 13-22 mIOU in 2-way 1-shot weak FSS on PASCAL VOC and MS COCO.
Abstract
Increasing attention is being diverted to data-efficient problem settings like Open Vocabulary Semantic Segmentation (OVSS) which deals with segmenting an arbitrary object that may or may not be seen during training. The closest standard problems related to OVSS are Zero-Shot and Few-Shot Segmentation (ZSS, FSS) and their Cross-dataset variants where zero to few annotations are needed to segment novel classes. The existing FSS and ZSS methods utilize fully supervised pixel-labelled seen classes to segment unseen classes. Pixel-level labels are hard to obtain, and using weak supervision in the form of inexpensive image-level labels is often more practical. To this end, we propose a novel unified weakly supervised OVSS pipeline that can perform ZSS, FSS and Cross-dataset segmentation on novel classes without using pixel-level labels for either the base (seen) or the novel (unseen) classes…
| Supervision | Method | Venue | mIOU | ||
|---|---|---|---|---|---|
| seen | unseen | harmonic | |||
| Pixel Labels | ZS3Net [3] | NeurIPS’19 | 78.0 | 21.2 | 33.3 |
| SPNet [23] | CVPR’19 | 77.8 | 25.8 | 38.8 | |
| CaGNet [22] | ACM MM’20 | 78.6 | 30.3 | 43.7 | |
| SIGN [61] | CVPR’21 | 83.5 | 41.3 | 55.3 | |
| STRICT [26] | CVPRW’21 | 82.7 | 35.6 | 49.8 | |
| Joint [6] | ICCV’21 | 77.7 | 32.5 | 45.9 | |
| ZegFormer [4] | CVPR’22 | 86.4 | 63.6 | 73.3 | |
| SimSeg [1] | ECCV’22 | 79.2 | 78.1 | 79.3 | |
| ZegCLIP [28] | - | 91.9 | 77.8 | 84.3 | |
| Image Labels | ViL-Seg [7] | ECCV’22 | - | 37.3 | - |
| DSG [57] | Multimedia’22 | 57.7 | 22.0 | 31.8 | |
| Ours (WLSegNet) | - | 86.5 | 59.9 | 70.8 | |
| Supervision | Method | Venue | Fold unseen mIOU | mean | |||
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | ||||
| Pixel Labels | SPNet [23] | CVPR’19 | 23.8 | 17.0 | 14.1 | 18.3 | 18.3 |
| ZS3Net [3] | NeurIPS’19 | 40.8 | 39.4 | 39.3 | 33.6 | 38.3 | |
| LSeg [5] (ResNet101) | ICLR’22 | 52.8 | 53.8 | 44.4 | 38.5 | 47.4 | |
| LSeg [5] (ViT-L/16) | ICLR’22 | 61.3 | 63.6 | 43.1 | 41.0 | 52.3 | |
| Fusioner [2] (ResNet101) | BMVC’22 | 46.8 | 56.0 | 42.2 | 40.7 | 46.4 | |
| Image Labels | Ours (WLSegNet) | - | 47.5 | 47.3 | 39.7 | 58.5 | 48.2 |
| Supervision | Method | Venue | Fold unseen mIOU | mean | |||
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | ||||
| Pixel Labels | ZS3Net [3] | NeurIPS’19 | 18.8 | 20.1 | 24.8 | 20.5 | 21.1 |
| LSeg [5] (ResNet101) | ICLR’22 | 22.1 | 25.1 | 24.9 | 21.5 | 23.4 | |
| LSeg [5] (ViT-L/16) | ICLR’22 | 28.1 | 27.5 | 30.0 | 23.2 | 27.2 | |
| Fusioner [2] (ResNet101) | BMVC’22 | 26.7 | 34.1 | 26.3 | 23.4 | 27.6 | |
| Image Labels | Ours (WLSegNet) | - | 8.33 | 15.3 | 26.8 | 16.0 | 16.6 |
| Supervision | Method | Venue | Fold mIOU | mean | |||
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | ||||
| Pixel Labels | PANet [8] | ICCV’19 | 42.3 | 58.0 | 51.1 | 41.2 | 48.1 |
| CyCTR [62] | NeurIPS’21 | 67.2 | 71.1 | 57.6 | 59 | 63.7 | |
| DPNet [9] | AAAI’22 | 60.7 | 69.5 | 62.8 | 58.0 | 62.7 | |
| ASNet [10] | CVPR’22 | 68.9 | 71.7 | 61.1 | 62.7 | 66.1 | |
| BAM [11] | CVPR’22 | 68.97 | 73.59 | 67.55 | 61.13 | 67.81 | |
| MSANet [12] | - | 70.8 | 75.2 | 67.25 | 64.28 | 69.13 | |
| B-Boxes | PANet [8] | ICCV’19 | - | - | - | - | 45.1 |
| Scribbles | PANet [8] | ICCV’19 | - | - | - | - | 44.8 |
| Image Labels | PANet [8] | ICCV’19 | 25.7 | 33.4 | 28.8 | 20.7 | 27.1 |
| AMP [63] | ICCV’19 | 10.6 | 14.1 | 7.6 | 10.9 | 10.8 | |
| PFENet [64] | TPAMI’20 | 33.4 | 42.5 | 43.6 | 39.9 | 39.9 | |
| Pix-MetaNet [56] | WACV’22 | 36.5 | 51.7 | 45.9 | 35.6 | 42.4 | |
| Ours (WLSegNet) ResNet50 | - | 41.7 | 51.3 | 42.2 | 41.8 | 44.2 | |
| Ours (WLSegNet) ResNet101 | - | 45.9 | 46.9 | 47.2 | 41.5 | 45.4 | |
| Supervision | Method | Venue | Fold mIOU | mean | |||
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | ||||
| Pixel Labels | PANet [8] | ICCV’19 | - | - | - | - | 20.9 |
| CyCTR [62] | NeurIPS’21 | 38.9 | 43.0 | 39.6 | 39.8 | 40.3 | |
| DPNet [9] | AAAI’22 | - | - | - | - | 37.2 | |
| ASNet [10] | CVPR’22 | - | - | - | - | 42.2 | |
| BAM [11] | CVPR’22 | 43.4 | 50.6 | 47.5 | 43.4 | 46.2 | |
| MSANet [12] | - | 47.8 | 57.4 | 48.6 | 50.4 | 51.1 | |
| Image Labels | PANet [8] | ICCV’19 | 12.7 | 8.7 | 5.9 | 4.8 | 8.0 |
| Pix-MetaNet [56] | WACV’22 | 24.2 | 12.9 | 17.0 | 14.0 | 17.0 | |
| Ours (WLSegNet) | - | 34.9 | 23.4 | 12.4 | 18.3 | 22.2 | |
| Sup | Method | Fold mIOU | mean | |||
|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | |||
| Pix | Pix-MetaNet | 36.5 | 51.8 | 48.5 | 38.9 | 43.9 |
| PANet | - | - | - | - | 45.1 | |
| Img | PANet | 24.5 | 33.6 | 26.3 | 20.3 | 26.2 |
| Pix-MetaNet | 31.5 | 46.7 | 41.4 | 31.2 | 37.7 | |
| Ours (WLSegNet) | 50.9 | 52.9 | 45.5 | 53.4 | 50.7 | |
| fold 0 | fold 1 | fold 2 | fold 3 |
| aeroplane, boat, chair, diningtable, dog, person | bicycle, bus, horse, sofa | bird, car, pottedplant, sheep, train, tvmonitor | bottle, cat, cow, motorbike |
| Supervision | Method | Setting | Fold mIOU | mean | |||
|---|---|---|---|---|---|---|---|
| 200 | 201 | 202 | 203 | ||||
| Pixel Labels | LSeg [5] | zero-shot | 24.6 | - | 34.7 | 35.9 | 31.7 |
| Fusioner [2] | zero-shot | 39.9 | 70.7 | 47.8 | 67.6 | 56.5 | |
| Image Labels | Ours (WLSegNet) | weak zero-shot | 17.6 | 50.3 | 19.5 | 52.4 | 34.9 |
| Pixel Labels | RPMM [66] | 1-way 1-shot | 36.3 | 55.0 | 52.5 | 54.6 | 49.6 |
| PFENet [67] | 1-way 1-shot | 43.2 | 65.1 | 66.5 | 69.7 | 61.1 | |
| CWT [68] | 1-way 1-shot | 53.5 | 59.2 | 60.2 | 64.9 | 59.5 | |
| Image Labels | Ours (WLSegNet) | weak 1-way 1-shot | 44.1 | 44.2 | 37.1 | 60.3 | 46.4 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsContrastive Language-Image Pre-training · Balanced Selection
A Language-Guided Benchmark for Weakly Supervised Open Vocabulary Semantic Segmentation
Prashant Pandey
Mustafa Chasmai
Monish Natarajan
Brejesh Lall
Abstract
Increasing attention is being diverted to data-efficient problem settings like Open Vocabulary Semantic Segmentation (OVSS) which deals with segmenting an arbitrary object that may or may not be seen during training. The closest standard problems related to OVSS are Zero-Shot and Few-Shot Segmentation (ZSS, FSS) and their Cross-dataset variants where zero to few annotations are needed to segment novel classes. The existing FSS and ZSS methods utilize fully supervised pixel-labelled seen classes to segment unseen classes. Pixel-level labels are hard to obtain, and using weak supervision in the form of inexpensive image-level labels is often more practical. To this end, we propose a novel unified weakly supervised OVSS pipeline that can perform ZSS, FSS and Cross-dataset segmentation on novel classes without using pixel-level labels for either the base (seen) or the novel (unseen) classes in an inductive setting. We propose Weakly-Supervised Language-Guided Segmentation Network (WLSegNet), a novel language-guided segmentation pipeline that i) learns generalizable context vectors with batch aggregates (mean) to map class prompts to image features using frozen CLIP (a vision-language model) and ii) decouples weak ZSS/FSS into weak semantic segmentation and Zero-Shot segmentation. The learned context vectors avoid overfitting on seen classes during training and transfer better to novel classes during testing. WLSegNet avoids fine-tuning and the use of external datasets during training. The proposed pipeline beats existing methods for weak generalized Zero-Shot and weak Few-Shot semantic segmentation by 39 and 3 mIOU points respectively on PASCAL VOC and weak Few-Shot semantic segmentation by 5 mIOU points on MS COCO. On a harder setting of 2-way 1-shot weak FSS, WLSegNet beats the baselines by 13 and 22 mIOU points on PASCAL VOC and MS COCO, respectively. Without using dense pixel-level annotations, our results for MS COCO ZSS are comparable to fully supervised ZSS methods. We also benchmark weakly supervised Cross-dataset Segmentation.
keywords:
Zero-Shot Segmentation , Few-Shot Segmentation , Cross-dataset Segmentation , Vision-Language Models , Generalizable Prompt Learning , Weakly Supervised Segmentation.
††journal: Pattern Recognition
\affiliation
[inst1]organization=Department of Electrical Engineering,addressline=Indian Institute of Technology, city=New Delhi, country=India
\affiliation
[inst2]organization=Department of Computer Science ,addressline=Indian Institute of Technology, city=New Delhi, country=India
\affiliation
[inst3]organization=Department of Computer Science ,addressline=Indian Institute of Technology, city=Kharagpur, country=India
{highlights}
First method to explore multiple and related Open Vocabulary Semantic Segmentation inductive tasks in a weakly supervised setting without using external datasets and fine-tuning
First method to handle weakly supervised generalized zero-shot segmentation, zero-shot segmentation and few-shot segmentation with a single training procedure using a frozen vision-language model
Propose a novel and scalable mean instance aware prompt learning that generates highly generalizable prompts, handles domain shift across the datasets and generalizes efficiently to unseen classes
The flexible design allows easy modification and optimization of different components as and when required
The proposed method beats existing weakly supervised baselines by large margins while being competitive with pixel-based methods
1 Introduction
Since its first arrival, semantic segmentation has gained a lot of attention and remarkable progress has been made in this field. The advent of deep learning led to a new era with methods aiming to surpass human performance. However, most semantic segmentation methods are trained with dense pixel-level annotations and require large numbers of examples for every category or task. Humans, in contrast, are able to recognize or at least have some context about new objects without ever seeing them before. Although natural to humans, this task requires a complex understanding of the semantic meaning of a class/category never seen by a learner and a capacity to generalize knowledge gained from seen classes. Most fully supervised methods, although performing comparably with humans on seen objects, struggle in this generalisation to unseen objects. Obtaining large numbers of fully annotated data for every target class can be extremely expensive and often impractical, so there is a need for models to be able to generalize to unseen classes.
To bridge the gap between human and artificial learning, increasing attention is being diverted to more challenging and data-efficient settings like Open Vocabulary Semantic Segmentation (OVSS) [1, 2]. In this paper, we focus on closely-related standard OVSS settings like Generalized Zero-Shot Segmentation (GZSS), Zero-Shot Segmentation (ZSS), Few-Shot Segmentation (FSS) and Cross-dataset Segmentation in an inductive setting (unlabeled pixels and novel class names are not observed during training) as opposed to the transductive setting (unlabeled pixels and novel class names may be observed during training). In ZSS [3, 4, 5, 6, 7], a model is provided with a set of base (seen) classes to learn from and then expected to perform well on the novel (unseen) classes it does not have access to. A commonly used setting Generalised ZSS (GZSS) further imposes the expectation that the model in addition to novel classes should retain its performance on base classes as well. Similar to how humans can relate visual understanding of classes with similar-in-meaning names or categories, ZSS methods generalize semantic visual information using the semantic textual information provided by language models. Another slightly relaxed data efficient setting is FSS [8, 9, 10, 11, 12, 13, 14], where the model is expected to generalize to unseen classes but is additionally given few support images with annotated unseen target classes. Typical FSS methods demonstrate admirable performance using support samples ranging from one to five examples for every unseen category.
Besides ZSS and FSS, many other problem settings aim to reduce the burden of large-scale annotations. A particularly interesting approach is Weakly Supervised Segmentation (WSS) [15, 16, 17, 18], where costly pixel-level annotations for training classes are replaced with relatively inexpensive weak labels like scribbles and bounding boxes. A particularly challenging setting here is that of image tags, where every image is accompanied by only the information of classes present in it. Without any information allowing the model to localize objects, this setting is perhaps the hardest for WSS.
In this paper, we explore the challenging and practical problems of weakly supervised ZSS (WZSS) and weakly supervised FSS (WFSS). With an expectation of generalization to unseen classes and reliance on only weak image-level labels, these settings greatly reduce the annotation cost and assess a method’s performance in challenging scenarios commonly faced by humans. A clear rift is evident between existing ZSS and FSS methods, where ZSS methods leverage language model-based learning and try to learn mappings between visual and textual features, while FSS methods tend to employ matching-based approaches that search semantically similar features between support and queries. We argue that when using weak labels, the Few-Shot tasks can also be de-coupled into WSS for learning to segment seen categories and ZSS for generalizing this learning to unseen categories. With this, we propose Weakly-Supervised Language-Guided Segmentation Network (WLSegNet), a unified method that can perform both WZSS and WFSS with a single training procedure. We also benchmark weakly supervised Cross-dataset segmentation setting where we train with weak image-level labels on one dataset (like MS COCO) and test on novel classes of a completely different dataset (like PASCAL VOC).
We further address limitations like overfitting on seen classes and large computational requirements reported by existing prompt-based learning methods [19, 20]. We employ batch aggregate (mean) image features to make learnable prompts image-aware, while maintaining low computational requirements. The learned prompts avoid overfitting on seen classes and generalise well to novel classes without aid from external datasets or fine-tuning. An overview of our proposed prompt learning approach is shown in Figure 2. In summary, our contributions are fourfold:
We propose a model to perform WZSS, WFSS and Cross-dataset segmentation in a unified manner in an inductive setting. To the best of our knowledge, we are the first to tackle these challenging yet impactful problems together that avoid fine-tuning and the use of external datasets with a frozen vision-language model (CLIP [21]).
- 2.
We propose an optimal pipeline that decouples the WZSS and WFSS problems into WSS and ZSS. This facilitates the optimization of WSS, mask proposal generation, and vision-language models, separately.
- 3.
We propose a novel mean instance aware prompt learning method that makes prompts more generalizable and less prone to overfitting while scaling the prompt learning stage to larger batch sizes and improving its computation speed.
- 4.
We perform extensive experiments on widely used PASCAL VOC and COCO datasets for WZSS and WFSS, beating previous weakly supervised baselines by large margins and obtaining results competitive with methods using strong supervision. We benchmark Cross-dataset segmentation by training with image-level labels of the COCO dataset and testing on novel classes of PASCAL VOC.
2 Related Work
2.1 Zero-Shot Segmentation
Existing zero-shot methods are broadly generative or discriminative, some of which further incorporate self-training to capture latent features of novel classes. Generative methods include ZS3Net [3] which trains a generator to generate synthetic features for unseen classes that are used along with real features of seen classes to train the classifier head, and CagNet [22] where-in feature generation is guided by contextual-information present in the image. Discriminative methods like SPNET [23] and LSeg [5] map the pixel-level features of an image with the word embeddings of its class obtained from pre-trained word encoders such as word2vec [24] or fastText [25]. STRICT [26] employs SPNET as a pseudo-label generator coupled with consistency regularization to improve Zero-Shot performance. Recent methods such as ZegFormer [4] and SimSeg [1] first learn to generate class-agnostic mask proposals using MaskFormer [27], and then classify proposal regions using knowledge of pre-trained vision-language models such as CLIP [21]. ZegCLIP [28] proposes a one-stage approach that directly extends CLIP’s zero-shot prediction capability from image to pixel level.
2.2 Weakly Supervised Segmentation
Weakly Supervised Segmentation methods deal with the practical setting of generating segmentation masks with models trained with weak forms of supervision such as bounding-box [29, 30, 31], scribbles [32, 33, 34], points [35] and image-level labels. Here we focus on WSS methods using image-level labels only. A commonly used strategy is to train a classifier and then use Class Activation Maps (CAMs) to obtain pixel-level pseudo labels. Some recent methods try to expand the initial seed regions highlighted by CAMs via adversarial erasing [36, 37, 38], region growing [39, 40, 41], random-walk [33, 42, 43] and stochastic inference [44] to name a few. Some methods refine initially coarse attention maps by trying to maximize object region and minimise background coverage. These include EPS [18] which uses supervision of saliency-maps from off-the-shelf saliency detectors to guide learning and RSCM [45] which incorporates high-order feature-correlation and improves masks through seed recursion. CIAN [46] propagates pixel-wise affinity to pixels in the neighbourhood. RCA [17] maintains a memory bank for storing object features across the images in the dataset which serves as a support during pseudo-label generation. L2G [15] transfers features learned by local region-wise classifiers to a global classification network, thus capturing greater detail and obtaining higher-quality attention maps.
2.3 Semantic Embeddings and Language Models
Transfer of knowledge from seen to unseen classes requires auxiliary information. Such information can be provided by the semantic embeddings of class names obtained from word encoders like word2vec [24] and fastText [25] which are trained on large-scale word datasets without human annotation, with the help of simple axioms like the one that says that words occurring often in similar contexts have closer feature representations. More recently transformer-based vision-language models such as CLIP [21] and ALIGN [47] have been pre-trained on large-scale image-text pairs from the web in a contrastive manner for zero-shot classification. The key idea in retrieving features in CLIP is to pass a sentence containing a class name along with context information which may be a predefined prompt-template [4] or learned [19, 20]. [19] shows that dataset-specific context information in prompt templates improves Zero-Shot classification accuracy using CLIP. In our work, we adopt CLIP and propose a novel prompt-learning technique that incorporates instance-specific context using a batch mean of input features in addition to dataset-specific context learning. A parallel line of works [48, 49, 50, 51, 52, 53] use image-level labels/captions, pre-train/fine-tune the vision-language or language pre-training models (like ALIGN, CLIP, BERT [54]), which require large-scale external datasets or perform transductive segmentation [55]. Fusioner [2], a cross-modality fusion module, explicitly bridges a variety of self-supervised pre-trained visual/language models for open-vocabulary semantic segmentation.
2.4 Zero and Few-shot Segmentation with Weak Supervision
Very few works have explored the practical setting of Zero and Few-Shot segmentation with weak annotations. [56] follows a meta-learning approach for few-shot segmentation. For a support image and a given weak label, it generates CAMs for a set of seen classes using a pre-trained network, then performs a weighted summation of these with the weights proportional to similarities of the textual features obtained by word2vec. Similarly, [57] first proposed the setting of WZSS using only image labels for seen classes as supervision. Another line of work is open-world segmentation [7] where models are trained using large-scale image captioning datasets without a need for dense pixel annotations. We take inspiration from previous works and propose a novel pipeline that unifies both Zero-Shot and Few-Shot segmentation using only weak labels as supervision.
3 Methodology
3.1 Problem Setting
The task of WFSS includes train and test weakly labelled datasets having non-overlapping class sets. The test dataset consists of a set of episodes with each episode containing -way -shot tasks with support and query sets. The support set has image and image-level label pairs with a total of semantic classes i.e. where is the ground-truth image tag for -th shot, and . The query set has . The objective in each test episode is to obtain high-quality segmentation predictions for the query set , relying on the weakly labelled support set , both of whose classes are never seen during training. The training dataset consists of a set of images and their corresponding image tags (weak labels). A common approach in FSS is to break down into episodes having support and query sets and then train episodically using metric learning. However, there is no restriction on the use of , i.e. a method may instead decide to use the images of support and query sets in a non-episodic way. WZSS is logically an extension of WFSS in that it is simply an N-way 0-shot WFSS task. However, it differs in some critical aspects in its formulation. WZSS consists of and , where the training dataset is identical to that in WFSS but the testing dataset simply consists of a set of images and the set of distinct classes . Let the set of all distinct classes present in the training dataset be . Depending on the nature of and , two different settings are possible. The first, default WZSS setting is when , i.e. the classes during testing are disjoint from the classes seen during training. The second setting of generalised Zero-Shot (WGZSS) holds when , i.e. the test dataset contains both seen and unseen classes.
3.2 Pseudo-label Generation Module
We adopt L2G [15], a weakly supervised semantic segmentation method for this task. Traditional CAMs obtained from classification networks tend to only highlight the discriminative regions of an object making it unsuitable for semantic segmentation. L2G transfers the knowledge learnt by multiple local regional classification networks to a single global classification network thus expanding and improving the region of focus in attention maps. In the method, a multi-target classification network is trained on the set of seen classes . For a given input image , the corresponding pseudo-segmentation mask is obtained from CAMs produced by the network. We denote the fully trained model as Pseudo-label Generation (PLG) Module. The masks generated by the PLG are then used to train the Class-Agnostic Mask Generation (CAMG) module. A few samples of the attention maps learnt can be seen in Figure 3.
While we use L2G [15] as our pseudo-label generator, we would like to point out that it can easily be replaced by another WSS method without changing the overall architecture much due to the decoupled design. Thus, rather than being constrained by L2G [15], we can leverage advances in the field of WSS to further improve the performance of WLSegNet. We do such an experiment and observe that with RCA [17] as the pseudo-label generator, the performance does not change drastically.
3.3 Class-Agnostic Mask Generation (CAMG)
The task of Zero-Shot segmentation is broken down into class-agnostic aggregation of pixels through the CAMG module followed by CLIP classification of aggregated regions (segments). Similar to past works, we adopt MaskFormer [27], a segmentation model that generates mask proposals for different objects present in the image irrespective of the class of any object. Specifically, for a given image fed to the CAMG module, a set of class-agnostic binary mask proposals are generated, such that . During training, only the pseudo-labels obtained from PLG are used as supervision in MaskFormer’s Mask Loss. For each mask or segment proposal , we create a corresponding input proposal by multiplying input image with to zero out the background in corresponding segments. The input proposals } thus obtained are then passed to CLIP for segment classification.
Instead of MaskFormer [27], the CAMG module can use other methods like GPB-UCM [58] and Selective Search [59] as well. Since [1] showed superior performance of MaskFormer, we perform all our experiments with it.
3.4 CLIP language model
For Zero-Shot classification of an image using CLIP, we first feed the CLIP text encoder with a corresponding text label in the form of a natural sentence for each class present. This can be a simple prompt such as “photo of a {class}” where {class} is replaced by the appropriate class name. However, such simple fixed prompts fail to capture context information present in the image and this affects the performance of the downstream task of segment classification.
3.4.1 Prompt Learning
To overcome the limitations of fixed prompts, many recent learned prompt works [19, 60] propose to incorporate dataset-specific context information by using learnable prompt context vectors that can be catered to work best for each particular class. These learned prompts are biased towards seen classes. [20] further improved upon this by using an additional input-conditional token, making the prompts less sensitive to class shift and thus, more generalizable to unseen classes. However, they report increased computation and resulting restrictions on the batch size. We overcome these limitations and propose a mean instance aware prompt learning strategy to learn better and more generalizable prompts. A brief overview of the different prompt strategies can be seen in Figure 4 for a better comparison.
Specifically, our prompt learning approach learns a context vector to capture context information of the dataset. is constructed such that it contains prompt tokens each of dimension , expressed by . Class prompt proposal is then obtained by concatenating the context vector with the class embedding of class such that where represents the class embedding. For a given image in input batch of size , let be the image embedding obtained from the pre-trained CLIP Image Encoder. Image embedding is passed through a shallow neural network represented by to obtain the instance-wise features . We then obtain the mean batch feature prototype as shown in Eq (1).
[TABLE]
Finally, for a given batch the mean instance aware class prompt is obtained as shown in Eq (2), which is fed to the pre-trained CLIP text encoder for Zero-Shot classification.
[TABLE]
is matrix with each column as repeated times. The extent of added to is controlled by a hyperparameter . The class prediction probability for the given image is computed as shown in Eq (3), where represents the text embedding of obtained from the CLIP text encoder, is the number of classes, is cosine similarity and is the temperature coefficient.
[TABLE]
The class predictions, combined with mask proposals are then aggregated to obtain semantic segmentation output. This provides more generalizable prompts compared to existing methods shown in Figure 4. WLSegNet facilitates scaling of prompt learning to larger batch size and thereby it is computationally faster than [20] while also being less prone to overfitting on seen classes.
3.5 Mask Aggregation
For every class , the mean instance aware class prompt is passed to the CLIP text encoder to obtain semantic embedding , which is used as the weights for classifying the segment embeddings of obtained from the CLIP Image Encoder. Note that some regions in the different proposals may overlap. Thus, the segmentation map is obtained by aggregating the different classified proposals as shown in Eq (4).
[TABLE]
Here, represents the predicted probability of pixel belonging to the -th mask proposal , and is the predicted probability of mask proposals belonging to -th category. This pixel-wise class probability is the final semantic segmentation output.
3.6 Weak Zero and Few-Shot Inference
Our method unifies both Zero-Shot and Few-Shot segmentation objectives with common training but different strategies for inference. During (generalized) Zero-Shot testing, for each input image the model segments pixels into seen and unseen classes. The prompts used by CLIP are kept the same for all images, containing one prompt for each class. On the other hand, in the Few-Shot evaluation, the only classes predicted for a particular query are those present in the weak label of the support of a particular task. Thus, the set of prompts used by CLIP varies across different tasks. This subtle difference can also be seen in Figure 1. Additionally for Few-Shot inference, we utilize saliency maps generated from off-the-shelf saliency detectors, as done in prior WSS works like EPS [18]. The saliency maps help refine the prediction of the difficult-to-describe background class while maintaining predictions of the foreground.
An overview of the complete procedure can be seen in Figure 5. Note how PLG training is completely decoupled from the rest of the method. This structure has certain desirable qualities. Since the tasks of generating good pseudo labels for seen classes and generalization to unseen classes are completely decoupled, they can be developed or optimized independently. Besides, the unified approach greatly reduces the training cost since a single training is sufficient for evaluation on WGZSS, WZSS, WFSS and Cross-dataset settings.
4 Implementation Details
We use one Nvidia A100 GPU to conduct our experiments on PASCAL VOC and 6 1080 GPUs for our experiments on MS COCO. For pixel pseudo labelling, we work with the ResNet38 backbone commonly used in WSS literature. For mask proposals, we use a ResNet50 backbone for COCO and ResNet101 backbone for PASCAL VOC, while for the CLIP language model, we use a ViT-B/16 backbone. CLIP remains frozen during our training, and we initialise it with pre-trained weights trained on publicly available image-caption data. All COCO experiments were trained on 6 GPUs with a batch size of 32 while PASCAL VOC experiments had a batch size of 16. We choose the value 0.01 for the tradeoff hyperparameter and 0.01 for the temperature . For , we use 2 fully connected layers separated by a ReLU in between. Embedding dimensions of both image and text are 512, and the size of dense layers is chosen to ensure that dimensions match. Other relevant hyperparameters are kept the same as previous works [15, 1]. All our implementations can be found here: https://github.com/mustafa1728/WLSegNet.
5 Experiments and Results
5.1 Datasets
We perform experiments with PASCAL VOC and MS COCO datasets, keeping settings similar to previous works. Our evaluation metrics closely follow the conventions used in [56].
5.1.1 PASCAL VOC 2012
This dataset consists of 11185 training images and 1449 validation images, with a total of 20 semantic classes. To compare with WFSS and WZSS methods, we use the Pascal- splits commonly used in FSS literature. From the dataset of 20 classes, 4 folds are created by splitting the classes such that in each fold, 15 classes are seen during training and 5 are reserved as novel classes for testing. Most previous works employing generalised ZSS use a fixed set of seen and unseen classes (classes 1-15 seen, 16-20 unseen). We take the same split while comparing these WGZSS and GZSS methods. While a model would predict all classes (seen and unseen) in these generalised settings, ZSS and WZSS on a particular fold of Pascal- involve the prediction of only unseen classes. This differs from the unseen-mIOU in GZSS primarily in the number of classes being predicted at a time, and one can expect similar performances in both. The model is trained on the training set images with seen classes retained and unseen classes ignored. Evaluation is on the validation images with the novel (and also seen for WGZSS) classes retained.
5.1.2 MS COCO 2014
This dataset consists of a total of 82081 training images and 40137 validation images, with a total of 80 semantic classes. Similar to PASCAL VOC, we employ the COCO- splits used in literature with 4 folds created by splitting classes into 60 seen and 20 unseen classes.
5.2 Results and Discussion
We have selected pixel-level and weakly supervised Zero and Few-shot segmentation methods as our baselines. Also, we compare with Open Vocabulary Segmentation methods like SimSeg [1] and Fusioner [2] that perform segmentation in an inductive setting without pre-training/fine-tuning the vision-language/language models with large-scale external datasets.
5.2.1 Weakly Supervised Zero-Shot Segmentation
The performance of our approach in GWZSS and WZSS settings can be seen in Table 1, Table 2 and Table 3. This domain is highly under-explored and we do not have baselines strictly following the same setting. Nonetheless, we compare WLSegNet with other strongly supervised methods. It can be seen in Table 1 that WLSegNet, using only image labels, beats 6 of the 9 baselines that use dense pixel labels. DSG [57] works on the same WZSS setting we explore and our method outperforms it by 28.8, 37 and 39 mIOU points for the seen, unseen and harmonic IOU measures. DSG [57] does not report results on COCO or COCO-stuff, so we do not have a comparable baseline for this dataset. Nevertheless, our results on PASCAL VOC and COCO are comparable with strongly supervised baselines, as can be seen in Table 2 and Table 3, respectively.
5.2.2 Weakly Supervised Few-Shot Segmentation
The performance of our approach in WFSS can be seen in Table 4 and Table 5. As evident from the results, we beat all methods using weak supervision by at least 7% mIOU on PASCAL VOC and at least 30% mIOU on COCO. Besides the commonly used 1-way 1-shot setting, we also experiment with 2-way 1-shot FSS in Table 7 and Table 7. Again, we beat weakly supervised baselines by huge margins. Our performance here exceeds all baselines by at least 13 and 22 mIOU points for PASCAL VOC and COCO respectively. On {1,2}-way 5-shot setting, WLSegNet clearly outperforms image-level baselines while being a strong contender for methods availing pixel-level supervision as observed from Figure 6 to Figure 8.
5.2.3 Cross-dataset Segmentation
Following the setting of [65], we evaluate the performance of WLSegNet on the novel classes of PASCAL VOC with the COCO-trained model without fine-tuning. These experiments test the ability of WLSegNet to handle domain shift between the classes of the two different datasets in the WZSS and WFSS settings. The novel PASCAL VOC classes are shown in Table 8. The categories in fold are the novel classes in PASCAL VOC after removing the seen classes in the corresponding training split on fold of COCO-20i. We benchmark the performance of WLSegNet on the Cross-dataset setting in Table 9. It is clearly evident from the results that even with domain shift, the generalizable prompts learned with WLSegNet help to deliver performance competitive with the pixel-based methods.
5.2.4 Qualitative Analysis
We visualize the predicted masks in different settings by WLSegNet which gives compelling results in the weak Few-Shot and Zero-Shot Segmentation. As observed from Figure 9, the proposed prompt learning strategy is able to capture complex objects while other strategies fail to segment the desired seen/unseen classes in the weak Generalized Zero-Shot Segmentation (WGZSS) setting. Similarly, in a comparatively harder setting of 2-way (Figure 12) with a large-scale dataset like COCO, WLSegNet is able to segment the required target classes having different sizes thereby closely matching the Ground Truth (GT).
5.3 Ablation Studies
We perform further experiments to understand the relative contributions of the various components in our approach. First, we experiment with different strategies to get text prompts. We try fixed prompts, where a single prompt template is used for all classes; ImageNet prompts, where a prompt template is chosen randomly from 80 prompts designed for ImageNet; Learned prompts, similar to the ones used in [1] and finally ours. Figure 13 (left) shows that our prompt learning method performs better for the unseen classes resulting in the highest harmonic mIOU. In Figure 13 (right), we analyse the performance for different values of the hyperparameter as used in Eq (2).
While experimenting with different batch sizes, we observe that the method is not sensitive to changes here. We evaluate the performance of WLSegNet (Table 10) by varying the mask proposal generation methods in the CAMG module, CLIP backbones and pseudo-label generation methods in the PLG module. These experiments helped to design and optimize the CAMG and PLG modules and in the selection of the backbone architecture for CLIP. Finally, we visualize the features of images obtained from the image encoder and the text features of prompts of different classes in Figure 14. The image features do roughly form clusters in the feature space and red text features corresponding to the classes are roughly aligned with the centres of these clusters. The clusters are better formed and text features are better aligned in the plot on the right, further demonstrating the generalizability of the learned prompts. All these ablation studies are performed on the PASCAL VOC dataset in the WGZSS setting.
6 Conclusion
Data-efficient problem settings like Open Vocabulary Semantic Segmentation (OVSS) are of utmost importance for an intelligent model because of the similar difficulties existing in many real-world scenarios. Extensive research is being done to develop novel methods that require significantly lesser annotation costs while maintaining expected standards of performance. We explore one such challenging domain (OVSS) where a model is expected to generalize to a wide range of classes it never sees during training while also only relying on relatively inexpensive weak annotations and vision-language models like CLIP. In a unified approach to weakly supervised Zero and Few-Shot segmentation, we overcome certain limitations reported by existing works and learn a label-efficient model and prompts that are highly generalizable to unseen classes. The superior performance of our method is corroborated by extensive experimentation on two large-scale datasets. We hope this work will promote further research in this relatively under-explored domain and provide a strong baseline to benchmark new methods.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] M. Xu, Z. Zhang, F. Wei, Y. Lin, Y. Cao, H. Hu, X. Bai, A simple baseline for open vocabulary semantic segmentation with pre-trained vision-language model, Proceedings of the European Conference on Computer Vision (ECCV) (2022).
- 2[2] C. Ma, Y. Yang, Y. Wang, Y. Zhang, W. Xie, Open-vocabulary semantic segmentation with frozen vision-language models, ar Xiv preprint ar Xiv:2210.15138 (2022).
- 3[3] M. Bucher, T.-H. Vu, M. Cord, P. Pérez, Zero-shot semantic segmentation, Advances in Neural Information Processing Systems 32 (2019).
- 4[4] J. Ding, N. Xue, G.-S. Xia, D. Dai, Decoupling zero-shot semantic segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11583–11592.
- 5[5] B. Li, K. Q. Weinberger, S. Belongie, V. Koltun, R. Ranftl, Language-driven semantic segmentation, in: International Conference on Learning Representations, 2022.
- 6[6] D. Baek, Y. Oh, B. Ham, Exploiting a joint embedding space for generalized zero-shot semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9536–9545.
- 7[7] Q. Liu, Y. Wen, J. Han, C. Xu, H. Xu, X. Liang, Open-world semantic segmentation via contrasting and clustering vision-language embedding, ar Xiv preprint ar Xiv:2207.08455 (2022).
- 8[8] K. Wang, J. H. Liew, Y. Zou, D. Zhou, J. Feng, Panet: Few-shot image semantic segmentation with prototype alignment, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9197–9206.
