CLAREL: Classification via retrieval loss for zero-shot learning
Boris N. Oreshkin, Negar Rostamzadeh, Pedro O. Pinheiro and, Christopher Pal

TL;DR
CLAREL introduces a novel instance-based deep metric learning method with semantic supervision for zero-shot learning, significantly improving fine-grained cross-modal classification performance, especially in generalized zero-shot settings.
Contribution
It demonstrates that per-image semantic supervision enhances zero-shot performance and provides a probabilistic basis for metric rescaling in generalized zero-shot learning.
Findings
Outperforms existing methods on CUB and FLOWERS datasets
Improves zero-shot classification accuracy
Addresses classifying unseen classes effectively
Abstract
We address the problem of learning fine-grained cross-modal representations. We propose an instance-based deep metric learning approach in joint visual and textual space. The key novelty of this paper is that it shows that using per-image semantic supervision leads to substantial improvement in zero-shot performance over using class-only supervision. On top of that, we provide a probabilistic justification for a metric rescaling approach that solves a very common problem in the generalized zero-shot learning setting, i.e., classifying test images from unseen classes as one of the classes seen during training. We evaluate our approach on two fine-grained zero-shot learning datasets: CUB and FLOWERS. We find that on the generalized zero-shot classification task CLAREL consistently outperforms the existing approaches on both datasets.
| cub | flowers | |||||||
|---|---|---|---|---|---|---|---|---|
| u | s | H | u | s | H | |||
| 0.0 | 0.5 | 0.5 | 38.3 | 65.3 | 48.3 | 55.1 | 84.6 | 66.7 |
| 0.0 | 0.5 | 0.0 | 39.3 | 57.5 | 46.7 | 54.0 | 78.1 | 63.8 |
| ✓ | 0.5 | 0.0 | 53.8 | 49.6 | 51.6 | 71.7 | 67.2 | 69.4 |
| ✓ | 0.0 | 0.5 | 47.4 | 36.6 | 41.3 | 51.5 | 60.5 | 55.6 |
| ✓ | 1.0 | 0.5 | 53.9 | 53.8 | 53.8 | 69.5 | 73.9 | 71.6 |
| ✓ | 0.5 | 0.5 | 59.3 | 52.6 | 55.8 | 73.0 | 73.6 | 73.3 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
CLAREL: Classification via retrieval loss for zero-shot learning
Boris N. Oreshkin
Element AI
Negar Rostamzadeh
Element AI
Pedro O. Pinheiro
Element AI
Christopher Pal
Element AI
Abstract
We address the problem of learning cross-modal representations. We propose an instance-based deep metric learning approach in joint visual and textual space. The key novelty of this paper is that it shows that using per-image semantic supervision leads to substantial improvement in zero-shot performance over using class-only supervision. We also provide a probabilistic justification and empirical validation for a metric rescaling approach to balance the seen/unseen accuracy in the GZSL task. We evaluate our approach on two fine-grained zero-shot datasets: cub and flowers.
1 Introduction
Deep learning-based approaches have demonstrated superior flexibility and generalization capabilities in information processing on a wide variety of tasks, such as vision, speech and language [15]. However, it has been widely realized that the transfer of deep representations to real-world applications is challenging due to the typical reliance on massive hand-labeled datasets. Learning in the low-labeled data regime, especially in the zero-shot [24] and the few-shot [25] setups, have recently received significant attention in the literature. In the problem of zero-shot learning (ZSL), the objective is to recognize categories that have not been seen during the training [14] via modality alignment. This is an especially relevant problem as machine learning is challenged with the long tail of classes, and the idea of learning from pairs of images and sentences, abundant on the web, looks like a natural solution. Therefore, in this paper we specifically target the fine-grained scenario of paired images and their respective text descriptions. The uniqueness of this scenario is in the fact that the co-occurance of image and text provides a rich source of information. The ways of leveraging this source have not been sufficiently explored in the context of ZSL.
In this paper, we specifically target the fine-grained visual description scenario, as defined by Reed et al. [20]. Concretely, given a training set of image, text and label tuples, we are interested in finding representations of image, parameterized by , and of text, parameterized by , in a common embedding space . Furthermore, generalized ZSL (GZSL) problem is defined using the sets of seen and unseen classes, such that and . The training set only contains the seen classes, i.e. and the task is to build a classifier function . This is different from the ZSL scenario focusing on . The most acute problem in GZSL setup is the accuracy imbalance between seen and unseen classes. To measure and control the imbalance, three metrics are commonly used to assess the classification performance in the GZSL scenario: the Top-1 accuracy on the seen categories (s), the Top-1 accuracy on the unseen categories (u) and their harmonic mean, . The contributions of this work can be characterized under the following two themes.
Instance-based training loss. Zero-shot learning approaches rely heavily on class-level modality alignment [30]. We propose a new composite loss function that balances instance-based pairwise image/text retrieval loss and the usual classifier loss. The retrieval loss term does not use class labels. We show that most of the GZSL accuracy can be extracted from the instance-based retrieval loss.
Metric rescaling. GZSL approaches suffer from imbalanced performance on seen and unseen classes [16]. Previous work proposed to use a heuristic trick, calibrated stacking [4] or calibration [5], to solve the problem. We provide a sound probabilistic justification for it.
2 Proposed Method
To build , most approaches to joint representation learning rely on class labeling to train a representation. For example, all the methods reviewed by Xian et al. [30] require the access to class labels at train time. We hypothesise that in the fine-grained learning scenario, such as the one described by Reed et al. [20], a lot of information can be extracted simply from pairwise image/text co-occurrences. The class labels really only become necessary when we define class prototypes, i.e. at zero-shot test time. Following this intuition, we define a framework based on projecting texts and images into a common space and then learning a representation based on a mixture of four loss functions: a pairwise text retrieval loss, a pairwise image retrieval loss, a text classifier loss and an image classifier loss (see Fig. 1 and Algorithm 1 in Appendix A). The framework enables us, among other things, to experiment with the effects of train-time availability of class labels on the quality of zero-shot representations.
Pairwise cross-modal loss function is based solely on the pairwise relationships between texts and images. Suppose is a metric , is an image and is a collection of arbitrary texts sampled uniformly at random, of which text belongs to . We propose the following model for the probability of image and text to belong to the same object instance:
[TABLE]
The learning is then based on the cross-entropy log-loss defined on the batch of size :
[TABLE]
where is a binary indicator of the true match (). Note that the expression above has the interpretation of the text retrieval loss. Exchanging the order of image and text in the probability model leads to the image retrieval loss, . The two losses are mixed using parameter as shown in Algorithm 1 in Appendix A. The pairwise retrieval loss functions are responsible for the modality alignment. In addition to those, we propose to include the usual image and text classifier losses responsible for reducing the intraclass variability of representations. The classifier losses are added to the retrieval losses using a mixing parameter as shown in Algorithm 1 in Appendix A.
2.1 Balancing Accuracy for Seen and Unseen
Let us define class prototype based on the set of texts belonging to class , . In GZSL, the nearest neighbor decision rule for a given image and its features has the following form:
[TABLE]
To formalize the problem, we first introduce , the true class label of image . Mathematically, the main GZSL pain point is that is significantly greater than . In other words, the problem is that a given image is more likely to be confused with one of the seen classes if it belongs to an unseen class than vice versa. We propose the following probabilistic representation of the event space for the decision rule in Equation (1):
[TABLE]
To balance and , we introduce a positive scalar and scale all the distances corresponding to the seen prototypes by , giving rise to the scaled distance :
[TABLE]
The error probability classifying unseen classes as seen ones for the classifier based on , , is then a monotone non-increasing function of and we can reduce it by increasing (please refer to Appendix B for a proof). Consider now , which is a probability that we classify an image from one of the seen classes as still one of the seen classes. Using exactly the same chain of arguments as in Appendix B we can show that the probability is a non-increasing function of . Hence the probability is a non-decreasing function of . Therefore, we expect that by varying we can balance the error rates and .
3 Related Work
ZSL approaches aim at recognizing objects belonging to classes unseen during training [14, 18]. This has been extended to the GZSL framework in which the decision space consists of both seen and unseen classes [22, 30]. The classical zero-shot approaches build a joint visual-semantic space, relying on a linear cross-modal compatibility function (e.g. dot-product between query embedding and semantic prototypes or a variation of a hinge loss) [7, 2, 1, 20]. Non-linear variants of the compatibility have also been explored [27, 22]. Extending previously proposed cross-modal transfer approaches based on auto-encoders [11] and cross-domain learning [9], more recent line of work [21, 31, 32, 6, 23] relies on combining these approaches and their variations with dataset augmentation tools such as GAN [8] and VAE [12]. It is argued that the use of those tools helps to resolve one of the prominent problems in GZSL scenario: classifying images from unseen classes as one of the seen classes. There exist approaches that try to tackle this same problem via temperature calibration [16] originally proposed by Hinton et al. [10]. Chao et al.[4] and Das et al. [5] proposed approaches to seen/unseen accuracy balancing that are very similar to ours, based on heuristic arguments. We extend this line of work here by providing a probabilistic justification for the balancing effect observed when applying metric rescaling. Atzmon et al. [3] propose a more sophisticated way to deal with seen/unseen imbalance via adaptive confidence smoothing and gating. In this work, we show that the simpler metric rescaling approach can still be used to achieve impressive results on the GZSL task.
4 Experimental Results
Datasets. We focus on learning embeddings for fine-grained visual descriptions and test them in ZSL/GZSL scenario. To test the quality of trained embeddings we focus on datasets that provide paired images and text descriptions, such as Caltech-UCSD-Birds (cub) [26] and Oxford Flowers (flowers) [17], that were augmented with textual descriptions by Reed et al. [20]. We use the GZSL splits proposed by Xian et al. [30]. The attribute-based datasets, such as SUN [19] and AWA [13] do not contain this information and are out of the scope of the current paper.
Architecture and training details. see Appendix D.
Our key empirical results are shown in Tables 1 and 2. Our results are based on the settings of , and selected on the validation sets of cub and flowers datasets. Clearly, the combination of the proposed training method and the rebalancing of the metric space results in very impressive performance, especially taking into account the simplicity of our method. In the rest of the section we further analyze the stability with respect to the choices of and and provide more details on the selection of .
The seen/unseen accuracy balancing. Fig. 2 confirms that H exhibits inverted U-shape behavior as a function of on the validation sets of cub and flowers datasets, as expected based on results of Section 2.1. Once the value of is determined by maximizing H on validation set, we train the representation on the full train+val subset and report results on the test split (the usual practice in GZSL). Validation set construction is detailed in Appendix C.
Ablation studies. Fig. 3 studies the importance of image and text retrieval losses. We see that all Top-1 accuracies (H, s, u) are stable in the range . Removing text retrieval loss () results in the most significant drop. Indeed, at the batch level, retrieving the correct text given an image is related to identifying the correct class encoded by a text prototype during GZSL inference step. Fig. 4 studies the interplay between the retrieval and the classification losses. We again observe that there exists a reasonably stable range of . results in the catastrophic performance drop: the classification losses alone do not enforce the necessary modality alignment.
Table 3 further studies the effects of different loss terms. The best result is achieved when all loss terms are active and when the metric rescaling is on (the last line in the table: and checked). Comparing this to the case with no metric rescaling (first line, ), we see that the rescaling helps to greatly decrease the gap between seen and unseen classification accuracy, both on cub and flowers. Interestingly, we only use images and texts from the training set to achieve it. Going to the second line in the table (the image/text classification loss is inactive, ) and comparing it to the first one, we assess the effect of the image/text classification loss. It barely affects the performance on unseen set, but it significantly boosts the classification accuracy on the seen set (around 8% on both datasets). However, it improves GZSL accuracy only when applied together with metric rescaling (please refer to lines 1 and 6 in Table 3). Our interpretation is that the image/text classifier loss reduces the intraclass variability and enforces tighter embedding clustering. Yet, this also leads to overfit on classification task. This is accounted for by metric rescaling that enables the learnings from the image/text classification task be transferred effectively into the GZSL task. Finally, an interesting observation can be made by comparing line 3 of Table 3 with performance of algorithms in Table 1. In this case our training relies only on retrieval losses computed without class labels solely based on the pairwise relationships between texts and images. The learned representation is competitive against the latest GAN/VAE based approaches on cub and is state-of-the-art on flowers. We conclude that when very fine-grained modality outputs are available (image and text pairs being a very prominent example), the high-quality representations may be learned without relying on manually supplied class labels.
5 Conclusions
We propose and empirically validate two contributions for learning fine-grained cross-modal representations. First, we confirm the hypothesis that in the context of paired images and texts, a deep metric learning approach can be driven by an instance-based retrieval loss resulting in impressive GZSL classification results. This demonstrates that high-quality deep representations can be trained relying largely on pairwise modality relationships. Second, we mathematically analyze and empirically validate a simple method of balancing seen/unseen accuracy in the GZSL task.
Appendix A Loss calculation algorithm
Appendix B The Analysis of Error Rates
First define the misclassification of unseen as seen classes for the classifier , based on :
[TABLE]
We show that . Let us define and , then Equation (B) can be rewritten as:
[TABLE]
Let us consider the probability of event and decompose it as follows:
[TABLE]
The transitions are based on the relationship between probabilities of arbitrary events and , , and in our case . This implies that:
[TABLE]
We have just shown that for a non-negative the probability of misclassifying an image from an unseen class as one of the seen classes is smaller for the decision rule than for the original decision rule . In fact, we can make a stronger claim. Since and are non-negative, it is clear that the length of interval increases as increases, and hence probability that falls in this interval is non-decreasing with increasing . Thus we have for any , , i.e. is a monotone non-increasing function of and we can reduce it by increasing .
Appendix C Constructing validation sets
The validation set is constructed by further splitting the train set on cub and flowers. For example, cub has a train set of 5875 images from 100 seen classes and a validation set of 2946 images from 50 unseen classes. We further divide the train set into 4700 train images from 100 seen classes, 1175 seen validation images (4700 + 1175 = 5875) and we use all the 2946 images from 50 classes as the unseen validation set.
Appendix D Architecture and Training Details
The text feature extractor is built by cascading two residual CNN blocks, followed by a BiLSTM. Each block has 3 convolutional/batch norm layers. The number of filters in the blocks is 128 and 256, BiLSTM has 512 filters for forward and backward branches (1024 total). All variables in the convolutional stack (including the batch normalization parameters and ) are L2-penalized with weight . The image feature extractor is a ResNet-101 with fixed weights pretrained on the split of ImageNet proposed by Xian et al. [30]. In this work we use precomputed image features, available in [28] for cub and in [29] for flowers. Image and text features are projected in the common embedding space of size 1024 with FC layers and no non-linearity. They are preceded with a dropout of 0.25. The trainable components of the model are trained for 150k batches of size 32 using SGD with initial learning rate of that is annealed by a factor of 10 every 50k batches. For each batch, we sample 32 instances, each instance includes a vector of precomputed ResNet-101 features and 10 text descriptions corresponding to it, according to the original dataset definition [20]. All 10 text descriptions are processed via the CNN/LSTM stack and the resulting embeddings are average pooled to create a vector representation of length 1024.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image classification. TPAMI , 2016.
- 2[2] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for fine-grained image classification. In CVPR , 2015.
- 3[3] Yuval Atzmon and Gal Chechik. Adaptive confidence smoothing for generalized zero-shot learning. In CVPR , 2019.
- 4[4] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV (2) , pages 52–68, 2016.
- 5[5] Debasmit Das and C Lee. Zero-shot image recognition using relational matching, adaptation and calibration. In International Joint Conference on Neural Networks , 2019.
- 6[6] Rafael Felix, Vijay Kumar B G, Ian Reid, and Gustavo Carneiro. Multi-modal cycle-consistent generalized zero-shot learning. In ECCV , 2018.
- 7[7] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc Aurelio Ranzato, and Tomas Mikolov. De Vi SE: A deep visual-semantic embedding model. In NIPS , pages 2121–2129, 2013.
- 8[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS , pages 2672–2680, 2014.
