Discriminative Embedding Autoencoder with a Regressor Feedback for   Zero-Shot Learning

Ying Shi; Wei Wei; and Zhiming Zheng

arXiv:1907.08070·cs.CV·July 19, 2019

Discriminative Embedding Autoencoder with a Regressor Feedback for Zero-Shot Learning

Ying Shi, Wei Wei, and Zhiming Zheng

PDF

Open Access

TL;DR

This paper introduces a discriminative autoencoder with regressor feedback for zero-shot learning, enhancing the discriminative features and semantic generalization to recognize unseen classes more effectively.

Contribution

It proposes a novel autoencoder model with regressor feedback that improves discriminative feature learning and generalization in zero-shot learning tasks.

Findings

01

Outperforms state-of-the-art models on four benchmark datasets.

02

Achieves significant improvements in generalized zero-shot learning.

03

Effectively learns discriminative features for object recognition.

Abstract

Zero-shot learning (ZSL) aims to recognize the novel object categories using the semantic representation of categories, and the key idea is to explore the knowledge of how the novel class is semantically related to the familiar classes. Some typical models are to learn the proper embedding between the image feature space and the semantic space, whilst it is important to learn discriminative features and comprise the coarse-to-fine image feature and semantic information. In this paper, we propose a discriminative embedding autoencoder with a regressor feedback model for ZSL. The encoder learns a mapping from the image feature space to the discriminative embedding space, which regulates both inter-class and intra-class distances between the learned features by a margin, making the learned features be discriminative for object recognition. The regressor feedback learns to map the…

Tables3

Table 1. Table 1: Details of dataset statistics for SUN, CUB, AWA1 and AWA2 in terms of granularity, number of attributes, number of classes in 𝒴 𝒮 subscript 𝒴 𝒮 {{\cal Y}_{\cal S}} and 𝒴 𝒰 subscript 𝒴 𝒰 {{\cal Y}_{\cal U}} , number of images for SS and PS.

					At Training Time		At Testing Time
Datasets	Granularity	Att	$𝒴_{𝒮} / 𝒴_{𝒰}$	Total	SS( $𝒴_{𝒮}$ )	PS( $𝒴_{𝒮}$ )	SS( $𝒴_{𝒰}$ )	PS( $𝒴_{𝒮} / 𝒴_{𝒰}$ )
SUN	fine	102	645/72	14340	12900	10320	1440	2580/1440
CUB	fine	312	150/50	11788	8855	7057	2933	1764/2967
AWA1	coarse	85	40/10	30475	24295	19832	6180	4958/5685
AWA2	coarse	85	40/10	37322	30337	23527	5985	5882/7913

Table 2. Table 2: Zero-Shot Learning results on SUN, CUB, AWA1 and AWA2 using SS and PS with ResNet features. The results report top-1 accuracy in %.

	SUN		CUB		AWA1		AWA2
Method	SS	PS	SS	PS	SS	PS	SS	PS
DAP[12]	38.9	39.9	37.5	40.0	57.1	44.1	58.7	46.1
IAP[12]	17.4	19.4	27.1	24.0	48.1	35.9	46.9	35.9
CONSE[8]	44.2	38.8	36.7	34.3	63.6	45.6	67.9	44.5
CMT[15]	41.9	39.9	37.3	34.6	58.9	39.5	66.3	37.9
SSE[16]	54.5	51.5	43.7	43.9	68.8	60.1	67.5	61.0
LATEM[6]	56.9	55.3	49.4	49.3	74.8	55.1	68.7	55.8
ALE[13]	59.1	58.1	53.2	54.9	78.6	59.9	80.3	62.5
DEVISE[35]	57.5	56.5	53.2	52.0	72.9	54.2	68.6	59.7
SJE[36]	57.1	53.7	55.3	53.9	76.7	65.6	69.5	61.9
ESZSL[14]	57.3	54.5	55.1	53.9	74.7	58.2	75.6	58.6
SYNC[18]	59.1	56.3	54.1	55.6	72.2	54.0	71.2	46.6
SAE[10]	42.4	40.3	55.8	33.3	80.7	53.0	80.8	54.1
Ours	64.3	62.4	56.1	53.9	81.0	79.8	81.2	78.5

Table 3. Table 3: Generalized Zero-Shot Learning results on SUN, CUB, AWA1 and AWA2 using the PS split measuring the 𝒴 𝒮 subscript 𝒴 𝒮 {{\cal Y}_{\cal S}} and 𝒴 𝒰 subscript 𝒴 𝒰 {{\cal Y}_{\cal U}} top-1 accuracies. The results report the H(harmonic mean) in %.

Method	SUN	CUB	AWA1	AWA2
DAP[12]	7.2	3.3	0.0	0.0
IAP[12]	1.8	0.4	4.1	1.8
CONSE[8]	11.6	3.1	0.8	1.0
CMT[15]	11.8	12.6	1.8	1.0
SSE[16]	4.0	14.4	12.9	14.8
LATEM[6]	19.5	24.0	13.3	20.0
ALE[13]	26.3	34.4	27.5	23.9
DEVISE[35]	20.9	32.8	22.4	27.8
SJE[36]	19.8	33.6	19.6	14.4
ESZSL[14]	15.8	21.0	12.1	11.0
SYNC[18]	13.4	19.8	16.2	18.0
SAE[10]	11.8	13.6	3.5	2.2
Ours	48.3	50.7	76.1	75.3

Equations24

ϕ_{e} (x_{i}^{s}) = W_{e}^{T} * x_{i}^{s}, x_{i}^{s} \in X_{S},

ϕ_{e} (x_{i}^{s}) = W_{e}^{T} * x_{i}^{s}, x_{i}^{s} \in X_{S},

L = \frac{1}{n _{s}} i = 1 \sum n_{s} max (0, m + d_{1} - d_{2}),

L = \frac{1}{n _{s}} i = 1 \sum n_{s} max (0, m + d_{1} - d_{2}),

x_{i}^{s} = ϕ_{d} ([ϕ_{e} (x_{i}^{s}), a_{i}^{s}]; W_{d}), x_{i}^{s} \in X_{S},

x_{i}^{s} = ϕ_{d} ([ϕ_{e} (x_{i}^{s}), a_{i}^{s}]; W_{d}), x_{i}^{s} \in X_{S},

L_{r eco n s t r} (x_{i}^{s}, a_{i}^{s}; ϕ, W) = \frac{1}{n _{s}} i = 1 \sum n_{s} ∥ x_{i}^{s} - x_{i}^{s} ∥^{2},

L_{r eco n s t r} (x_{i}^{s}, a_{i}^{s}; ϕ, W) = \frac{1}{n _{s}} i = 1 \sum n_{s} ∥ x_{i}^{s} - x_{i}^{s} ∥^{2},

L_{r e g_se m} = \frac{1}{n _{s}} i = 1 \sum n_{s} ∥ a_{i}^{s} - ϕ_{r e g} (x_{i}^{s}) ∥^{2},

L_{r e g_se m} = \frac{1}{n _{s}} i = 1 \sum n_{s} ∥ a_{i}^{s} - ϕ_{r e g} (x_{i}^{s}) ∥^{2},

L_{r e g_d i s} = \frac{1}{n _{s}} i = 1 \sum n_{s} ∥ ϕ_{e} (x_{i}^{s}) - ϕ_{r e g} (x_{i}^{s}) ∥^{2} .

L_{r e g_d i s} = \frac{1}{n _{s}} i = 1 \sum n_{s} ∥ ϕ_{e} (x_{i}^{s}) - ϕ_{r e g} (x_{i}^{s}) ∥^{2} .

L_{r e g} = L_{r e g_se m} + λ L_{r e g_d i s},

L_{r e g} = L_{r e g_se m} + λ L_{r e g_d i s},

L = L_{e n co d er} + α L_{r eco n s t r} + β L_{r e g},

L = L_{e n co d er} + α L_{r eco n s t r} + β L_{r e g},

x_{j}^{u} = ϕ_{d} (a_{j}^{u}; W_{d}), x_{j}^{u} \in X_{U},

x_{j}^{u} = ϕ_{d} (a_{j}^{u}; W_{d}), x_{j}^{u} \in X_{U},

y = ϕ_{c l s} (x_{j}^{u}, x_{j}^{u}, y_{j}^{u}),

y = ϕ_{c l s} (x_{j}^{u}, x_{j}^{u}, y_{j}^{u}),

a c c_{a v g}^{p er - c l a ss} = \frac{1}{∥ Y ∥} j = 0 \sum Y \frac{N _{cor r ec t}^{(c l a ss - i)}}{N _{t o t a l}^{(c l a ss - i)}},

a c c_{a v g}^{p er - c l a ss} = \frac{1}{∥ Y ∥} j = 0 \sum Y \frac{N _{cor r ec t}^{(c l a ss - i)}}{N _{t o t a l}^{(c l a ss - i)}},

H = \frac{2 * a c c _{Y_{S}} * a c c _{Y_{U}}}{a c c _{Y_{S}} + a c c _{Y_{U}}},

H = \frac{2 * a c c _{Y_{S}} * a c c _{Y_{U}}}{a c c _{Y_{S}} + a c c _{Y_{U}}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI · Multimodal Machine Learning Applications

MethodsSolana Customer Service Number +1-833-534-1729

Full text

Discriminative Embedding Autoencoder with a Regressor Feedback for Zero-Shot Learning

Ying Shi

Abstract

Zero-shot learning (ZSL) aims to recognize the novel object categories using the semantic representation of categories, and the key idea is to explore the knowledge of how the novel class is semantically related to the familiar classes. Some typical models are to learn the proper embedding between the image feature space and the semantic space, whilst it is important to learn discriminative features and comprise the coarse-to-fine image feature and semantic information. In this paper, we propose a discriminative embedding autoencoder with a regressor feedback model for ZSL. The encoder learns a mapping from the image feature space to the discriminative embedding space, which regulates both inter-class and intra-class distances between the learned features by a margin, making the learned features be discriminative for object recognition. The regressor feedback learns to map the reconstructed samples back to the the discriminative embedding and the semantic embedding, assisting the decoder to improve the quality of the samples and provide a generalization to the unseen classes. The proposed model is validated extensively on four benchmark datasets: SUN, CUB, AWA1, AWA2, the experiment results show that our proposed model outperforms the state-of-the-art models, and especially in the generalized zero-shot learning (GZSL), significant improvements are achieved.

1 Introduction

Humans can distinguish approximately 30,000 basic object categories[1] and many more subordinate ones, e.g., breeds of dogs and many combination of attributes and objects. Importantly, humans are very good at recognizing objects without seeing any visual samples. In machine learning, this is considered as the problem of zero-shot learning (ZSL). ZSL has gained its popularity in object recognition task and can be used in a variety of research areas, such as neural decoding from fMRI images, face verification, object recognition, video understanding and natural language processing[2]. The traditional object recognition models are to predict the labels of object classes that already exist in the training set, however, zero-shot learning aims to build a model used to recognize object classes from a new category never seen before. Therefore, in the ZSL task, the seen classes in the training set and the unseen classes in the test set are disjoint. The main challenge of the zero-shot learning is how to generalize the models to identify the novel object classes without any labelled samples of these categories. Ideally, it would replicate the human ability to recognize objects from a few image or even from a semantic description[3].

The key idea of zero-shot learning is to explore the knowledge of how an unseen class is semantically related to the seen classes[4]. An example about ZSL is illustrated in Figure 1. Seen and unseen classes are usually related in a high dimension vector space, called semantic space[5], where the knowledge from seen classes can be transferred to unseen classes. The semantic representation of categories (e.g., semantic attribute annotations[6], the text descriptions of the categories[7], the semantic word vectors of the class names[8], etc.) are required to share information between classes so that the knowledge learned from seen classes is transferred to unseen classes. Given a description of categories, each class name can be represented by an attribute vector or a semantic word vector. The semantic relationships between classes can be measured by a distance, e.g., the semantic of zebra and horse should be close to each other. One popular semantic representation is attributes, i.e., shared and nameable image properties of objects[4]. They are encoded in a high dimensional vector space. In this work, we focus on learning for ZSL with attributes.

Typically, some of the ZSL models learned a mapping function from an image feature space to a semantic embedding space using the labelled training data consisting of seen classes only; and then nearest neighbour (NN) search is performed in the projected semantic space where the label of the test image feature is matched by the nearest unseen class[5]. Existing models of ZSL focus on introducing linear or nonlinear mechanism and utilizing various optimization objective to learn the image feature-semantic mapping.

However, the final goal of ZSL is to classify the unseen classes. Therefore, the image feature and the semantic embedding should be discriminative to recognize different objects. Moreover, existing models mostly suffer from the projection domain shift problem[9], that is, if the projection for the image feature is learned only from the seen classes, the projection of unseen class image features is likely to be shifted due to the bias of the seen classes. This shift could be far away from the accurate unseen classes[10].

To address these issues, we propose a discriminative embedding autoencoder with a regressor feedback model for ZSL in both the image feature and semantic embedding space. Our contributions are three-fold:

•

The discriminative embeddings cluster the intra-classes and separate the inter-classes by a margin, which preserve the discriminative information of the image features. An encoder acts as the discriminator.

•

The regressor feedback acts as the generator’s regularizer to ensure the generated samples representative and accurate. The regressor feedback can assist the decoder to recover sufficient information contained in the image features and semantic embeddings to reconstruct the best image features.

•

Experimental results on public benchmark datasets validate the effectiveness of the proposed model, especially, the accuracy is significantly improved in the generalized zero-shot learning(GZSL).

The remainder of this paper is organized as follows. Section II summarizes the related work in zero-shot models, the autoencoder structure, and the generalized zero-shot learning. Section III introduces our proposed model architecture, our motivation, and every part of our model. Section IV describes the experiments, a comparison with existing methods, and analysis the performance of our model on the ZSL and GZSL settings. Finally, we conclude the paper with future work in Section V.

2 Related Work

Early work of zero-shot learning makes use of attribute with a two-stage approaches that first train different attribute classifiers and then recognize an image by comparing its predicted attributes with those of unseen classes[11]. For instance, DAP model[12] predicts the posterior of each attribute and then the class posteriors are calculated by maximizing a posterior. IAP model[12] first predicts the class posterior of seen classes, then the probability of each class is used to calculate the attribute posteriors of an image. In these methods, each attribute classifier is trained individually and the relationship between attributes for a class is not considered[11].

2.1 linear and nonlinear embedding models

Recent advances in zero-shot learning typically learn an embedding from the image feature space to the semantic space, where the embedding is learned via a linear parameterized mapping. During testing, for an unseen class, the semantic vector is predicted and the neighbor class is assigned. The ALE[13] learns a bilinear compatibility function between the image and the attribute space using the ranking loss. The ESZSL[14] uses the square loss to learn the embedding and explicitly regularizes the objective. The SCoRe[3] adds a semantically consistent regularization to make the learned mapping perform better on test images. The SAE[10] uses a linear semantic autoencoder that its decoder acts as an additional constraint on the mapping to reconstruct the original image features.

In addition, non-linear compatibility mapping models have also been proposed. The LATEM[6] proposes piecewise compatibility modal learning which learns nonlinear compatibility function and the CMT[15] trains a neural network with two hidden layers to learn a nonlinear mapping from image feature space to word2vec space. The DEM[5] argues that the image feature space is more discriminative than semantic space, thus it proposes an end-to-end deep embedding model which maps from semantic space into the image feature space.

2.2 embedding into common intermediate space

Another direction of zero-shot learning embeds the image feature and the semantic into common intermediate space. The JLSE[16] maps the image features and the semantic space into two separate latent spaces, and measures their similarity by learning another bilinear compatibility function. The LAD[17] proposes to learn a latent attribute space, which is not only discriminative but also semantic-preserving. The SYNC[18] constructs the classifier of unseen classes by taking the linear combinations of base classifiers, which are trained in a discriminative learning framework. Annadani et al.[19] captures semantic relations defined on the categories themselves to learn the intermediate embedding space.

Different from them, our proposed model directly regulates both inter-class and intra-class distances between the learned features to achieve the discriminative embedding. The discriminative feature space and the semantic space jointly embed into the common intermediate space.

2.3 generative models

There are a few generative models that represent each class as a probability distribution. The GFZSL[20] treats each class-conditional distribution as a Gaussian and learns a regression function that maps a class embedding into the latent space. The GLAP[21] assumes that each class-conditional distribution follows a Gaussian and generates virtual samples of unseen classes from the learned distribution. Mukherjee et al.[22] learns a multimodal mapping where semantic and image embeddings of classes are both represented by Gaussian distributions. M. Bucher et al.[23] adopts generative model for data augmentation of unseen classes and uses these samples to train a classification model.

2.4 the autoencoder structure

The autoencoders are used for classification based on the assumption that higher dimensional features are better classification[24]. The SAE[10] model is a semantic autoencoder. Its decoder imposes an additional constraint in learning the visual to semantic mapping. This is very effective in mitigating the domain shift problem. This is because although the visual appearance of attributes may change from seen classes to unseen classes, the demand for more truthful reconstruction of the visual features is generalizable across seen and unseen domains, resulting in the learned project function less susceptible to domain shift[10]. Similarly, in our model, the encoder maps the image feature to the semantic embedding and the decoder reconstructs the original image feature to recover all the image feature and semantic information. Differently, our model proposes the discriminative feature in the embedding space and the decoder imposes a regressor feedback to the truthful and representative image feature. At test time, we use the decoder to generate the reconstructed unseen image features and then train an SVM classifier.

Zero-shot learning has been restrictive with a strong assumption that the image used to predict can only come from unseen classes. Therefore, generalized zero-shot learning has been proposed in [25] to generalize the zero-shot learning to the case where both seen and unseen classes are used during testing. Chao et al. [25] showed that it is nontrivial and ineffective to directly extend the current zero-shot learning approaches to solve the generalized zero-shot learning. Such a generalized setting, due to the more practical nature, is recommended as the evaluation settings for zero-shot learning [2]. We evaluate our model on the four benchmark datasets with SS and PS[4] for the two settings.

3 Proposed Approach

3.1 Problem Definition

In the zero-shot learning (ZSL), the set of train classes (also called seen classes) is defined as ${\cal S}\equiv\left\{{\left({x_{i}^{s},y_{i}^{s}}\right)}\right\}_{i=1}^{{n_{s}}}$ , where $x_{i}^{s}\in{{\cal X}_{\cal S}}$ is the i-th image feature of the seen class and $y_{i}^{s}\in{{\cal Y}_{\cal S}}$ is its corresponding class label, ${{n_{s}}}$ represents the number of the image feature of seen classes. The set of test classes (also called unseen classes) is defined as ${\cal U}\equiv\left\{{\left({x_{j}^{u},y_{j}^{u}}\right)}\right\}_{j=1}^{{n_{u}}}$ , where $x_{j}^{u}\in{{\cal X}_{\cal U}}$ is the j-th image feature of the unseen class and $y_{j}^{u}\in{{\cal Y}_{\cal U}}$ is the label of it, ${{n_{u}}}$ represents the number of the image feature of unseen classes. The seen and unseen classes are disjoint, i.e., ${{\cal Y}_{\cal S}}\cap{{\cal Y}_{\cal U}}=\emptyset$ . We work in the image feature space instead of the image space. A key of zero-shot learning is the semantic embedding of the class labels. In this work, the class semantic embeddings are represented to the attribute vectors. Distributed word representations of the class name such as word2vec [26] have also been used as the semantic embedding. The attributes for both seen and unseen classes can be denoted as ${{\cal A}_{\cal S}}\equiv\left\{{a_{i}^{s}}\right\}_{i=1}^{{c_{s}}}$ and ${{\cal A}_{\cal U}}\equiv\left\{{a_{j}^{u}}\right\}_{j=1}^{{c_{u}}}$ , where $a_{i}^{s}$ and $a_{j}^{u}$ respectively indicate the attribute vectors for the i-th seen class and the j-th unseen class, ${{c_{s}}}$ and ${{c_{u}}}$ represent the number of the attribute vectors for seen classes and unseen classes, respectively. At test time, given a test image feature ${x^{u}}$ and the attribute of test classes $a^{u}$ , the goal of ZSL is to predict the correct class of ${x^{u}}$ , without trained classifier by unseen classes.

3.2 Model Architecture

The framework of our model is shown in Figure 2. Our model consists of four different components: 1) The image features ${{\cal X}_{\cal S}}$ are encoded to the discriminative embeddings which have the same dimension as the semantic embeddings ${{\cal A}_{\cal S}}$ . 2) The discriminative embeddings and the semantic embeddings ${{\cal A}_{\cal S}}$ are concatenated and decoded to reconstruct the original image features ${{\cal X}_{\cal S}}$ . 3) The reconstructed image features ${\widehat{\cal X}_{\cal S}}$ are mapped back to the corresponding semantic embeddings ${{\cal A}_{\cal S}}$ and the discriminative embeddings, providing a feedback to the decoder. 4) Inputting each unseen class vector $a_{j}^{u}$ to the decoder can generate the reconstructed unseen classes data used for classification of the unseen classes. An autoencoder is one realisation of the encoder-decoder paradigm. In our model, the encoder acts as a discriminator, and the decoder acts as a generator. The autoencoder is responsible for generating the image features. The more truthful reconstructed image feature is generalizable across the seen classes ${{\cal Y}_{\cal S}}$ and the unseen classes ${{\cal Y}_{\cal U}}$ , which builds a bridge between classes.

3.3 Motivation

The autoencoder aims to get the reconstructed image features that are expected to recover sufficient semantic information and the discriminative features, and then provides a good generalization to the unseen classes. For this goal, we propose the discriminative embedding and the regressor feedback, and details of them are in the following two sections. On one hand, the discriminative embeddings have learned the discriminative features by a nonlinear dense network with the triplet loss[27], and the learned features preserve the discriminative information. On the other hand, the discriminative embeddings and semantic embeddings are concatenated to train the generator. The regressor feedback acts as the generator’s regularizer from the output of the generator back to the semantic embeddings and the discriminative embeddings, which makes the generation of samples contain sufficient discriminative and semantic information. The recurrent structure achieves the coarse-to-fine process at each iteration. The generator is used for generating the unseen image feature to train a classifier.

In other words, the discriminative embeddings and semantic embeddings may have a correlation relationship. Fusing them can help the generator reconstruct the image features that are representative and accurate for the corresponding class, and then can transfer the relationship knowledge from the seen classes to unseen classes. Also, that’s why two same unseen semantic embeddings are concatenated during testing, which is valid to act as the input of the generator.

3.4 Encoder

The image features ${{\cal X}_{\cal S}}$ is trained by the nonlinear dense network to obtain the discriminative embeddings:

[TABLE]

where ${\phi_{e}}\left({x_{i}^{s}}\right)$ is the discriminative embedding and the output of the last dense layers, $*$ denotes a set of operations of the encoder and ${{W_{e}}}$ indicates the overall parameters of the encoder. The image feature $x_{i}^{s}$ is projected into the discriminative embedding ${\phi_{e}}\left({x_{i}^{s}}\right)$ , which generates discriminative image features.

The image features ${{\cal X}_{\cal S}}$ as the input of the encoder pass through two hidden layers, followed by a dense layer with the linear activation. The discriminative embeddings ${\phi_{e}}\left({{{\cal X}_{\cal S}}}\right)$ as the output have the same dimension as the semantic embedding vector.

3.5 Discriminative Embedding

Most of the embedding models to solve the ZSL problem are based on the semantic embeddings. However, it is of limited size and not discriminative. To address this issue, we introduce the discriminative embeddings, which learn the image features from high dimension to low dimension and make the learned features be discriminative for object recognition.

We want to ensure that an image feature $x_{i}^{s}$ (anchor) of a specific class is closer to all other image feature $x_{k}^{s}$ (positive) of the same class than any image feature $x_{j}^{s}$ (negative) of any other class. The triplet loss[27] minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity[28]. We utilize the triplet loss to learn the discriminative embeddings with regulating the inter and intra class distances between the learned features:

[TABLE]

where ${d_{1}}=d\left({{\phi_{e}}\left({x_{i}^{s}}\right),{\phi_{e}}\left({x_{k}^{s}}\right)}\right)$ , ${{\phi_{e}}\left({x_{i}^{s}}\right)}$ and ${{\phi_{e}}\left({x_{k}^{s}}\right)}$ are the learned image features from the same class. ${d_{2}}=d\left({{\phi_{e}}\left({x_{i}^{s}}\right),{\phi_{e}}\left({x_{j}^{s}}\right)}\right)$ , ${{\phi_{e}}\left({x_{i}^{s}}\right)}$ and ${{\phi_{e}}\left({x_{j}^{s}}\right)}$ are from different classes. $d\left({x,y}\right)$ is the squared Euclidean distance between $x$ and $y$ . The distance between the inter/intra class should be lager than a margin $m>0$ .

In the embedding space, the semantic embeddings ${{\cal A}_{\cal S}}$ and the discriminative embeddings are concatenated, which allows the generator learned from not only the semantic embeddings ${{\cal A}_{\cal S}}$ but also the image features ${{\cal X}_{\cal S}}$ . They contain meaningful and complementary information, and fusing them can potentially improve the quality of the reconstructed image features.

3.6 Decoder

The decoder acts as the generator that the mapping must be able to reconstruct the original image features. It is expected to preserve sufficient semantic and discriminative information so as to reconstruct the high-quality and class-specific image features. Specifically, the discriminative embeddings ${\phi_{e}}\left({{{\cal X}_{\cal S}}}\right)$ concatenated with the semantic embeddings ${{\cal A}_{\cal S}}$ are projected to the image features ${{\cal X}_{\cal S}}$ . The reconstructed image feature denotes:

[TABLE]

where $\widehat{x}_{i}^{s}$ denotes the reconstructed image feature of the i-th seen class, ${\phi_{d}}$ denotes the output of the last dense layers and ${{W_{d}}}$ indicates the overall parameters of the decoder. ${\widehat{\cal X}_{\cal S}}$ denotes the reconstructed image feature space.

We observed that training with two hidden layers in the decoder quickly overfits to the seen classes, then one hidden layer is used with the LeakyReLU [30] activation, followed by a dense layer with the linear activation. The input is the semantic embedding space ${{\cal A}_{\cal S}}$ and the discriminative embeddings ${\phi_{e}}\left({{{\cal X}_{\cal S}}}\right)$ , and the output is the reconstructed image feature space ${\widehat{\cal X}_{\cal S}}$ .

The encoder and the decoder are linked together by the discriminative embeddings. During training, the image feature ${x_{i}^{s}}$ is the input of the encoder and the reconstructed image feature ${\widehat{x}_{i}^{s}}$ is the decoder’s output, and the semantic embedding $a_{i}^{s}$ is the intermediate condition. the reconstruction objective function becomes:

[TABLE]

where $\phi$ denotes the mapping from ${x_{i}^{s}}$ to ${\widehat{x}_{i}^{s}}$ and $W$ is the overall parameters of the decoder. It is necessary that the output of the decoder can reconstruct the image feature.

3.7 Regressor Feedback

The feedback mechanism[29] allows the network to carry high-level information back to previous layers and refine low-level encoded information. It reroutes the output back into the model to produce the best output in each iteration. Motivated by this, we apply the feedback mechanism to our model architectures, as shown in Figure 3.

Our model consists of a mapping from the decoder’s output ${\widehat{\cal X}_{\cal S}}$ to the semantic embeddings ${{\cal A}_{\cal S}}$ and the discriminative embeddings. The output of the decoder at each iteration flows into the next iteration to modulate the input. This mapping is a multivariate regression network learned jointly with the rest of the model and not an independent part. The recurrent structure is trained to produce better unseen classes at each iteration, i.e., coarse samples which involve fewer features of the class are generated in the first iterations and finer ones can be achieved as the proceeding of the iterations. The generator with the feedback results in generation of samples that can be discriminated easily.

Regressor Feedback Network The regressor maps the generator back to the semantic embeddings and the discriminative embeddings, ensuring that the generator can recover sufficient discriminative and semantic information and provide a generalization to the unseen classes. Following the output of the generator/decoder, the reconstructed image features ${\widehat{\cal X}_{\cal S}}$ enter one hidden layer through a feedback connection. We use the regressor to improve the generator, which is learned using two source of data. The reconstructed image feature ${\widehat{x}_{i}^{s}}$ maps to the sematic embedding, and we can define a sematic loss, given by:

[TABLE]

where ${{\phi_{reg}}}$ represents the regressor mapping, i.e., the feedback connection. The reconstructed image feature ${\widehat{x}_{i}^{s}}$ maps to the discriminative embedding, and we can define a discriminative loss, given by:

[TABLE]

The overall training objective of the regressor is defined as the following weighted combination of the above two:

[TABLE]

where $\lambda$ is a weighting coefficient that controls the importance of first and second terms. The feedback mechanism allows the network to carry the ${\widehat{x}_{i}^{s}}$ back to previous layers and refine the ${a_{i}^{s}}$ and ${{\phi_{e}}\left({x_{i}^{s}}\right)}$ encoded information. It reroutes the output back into the model to improve the quality of the reconstructed data that can be used to train the final classifier.

3.8 Full Obiective

Combining the objective functions introduced above, the full objective of our proposed model is:

[TABLE]

where $\alpha$ and $\beta$ are trade-off parameters for different objectives. We minimise the objective function to estimate the parameters of our model.

The encoder preserves the discriminative information of the image feature and the decoder with an additional regressor regularizer generates data that are highly discriminative in nature, as guided by the semantic embeddings and the discriminative embeddings. The autoencoder mechanism aims to generate the reconstructed data containing sufficient semantic and discriminative information that can provide a generalization to unseen classes.

3.9 ZSL prediction

For the unseen classes ${a_{j}^{u}}$ , their semantic embedding vectors are known. For classification, the generator of our trained model is used to map the semantic embedding of the unseen class to generate the corresponding reconstructed unseen image feature:

[TABLE]

where ${\widehat{\cal X}_{\cal U}}$ represents the reconstructed unseen image feature space. Once the data is generated, we can use the reconstructed unseen image feature $\widehat{x}_{j}^{u}$ and its label $y_{j}^{u}$ to train the classifier for the unseen classes. In this work, an SVM classifier and the accuracy score are used to predict an unseen class label:

[TABLE]

where ${{\phi_{cls}}}$ denotes the output of the SVM classifier. Finally, the most matched unseen class is selected.

4 Experiments

4.1 Datasets

Our proposed model is evaluated on ZSL benchmark datasets: SUN Attribute (SUN)[31], Caltech-UCSD Birds 200-2011 (CUB)[32], Animals with Attributes 1 (AWA1)[33], Animals with Attributes 2 (AWA2)[2]. Details of these datasets are listed in Table 1.

It is observed in [4] that some of the testing classes in the standard splits (SS) of the datasets are the subsets of the Imagenet[34] classes. Hence, extracting features from Imagenet trained model will not represent a true performance, but the SS has been widely used by some recent zero-shot learning models. Xian et al. [4] propose a new dataset split - the propose split (PS) ensuring that none of the test classes contain ImageNet classes. The differences between SS and PS are shown in Table 1.

The image features are 2048-dim top-layer pooling units of the 101-layered ResNet that is pre-trained on ImageNet[4]. All methods are evaluated with published image features. Continuous values between 0 and 1 are used for the class attributes that are provided with the datasets as the semantic embeddings which perform better than the binary attributes[4]. The semantic embedding is a word2vec trained on Wikipedia provided by [18]. We use the average per class top-1 accuracy[4] as the evaluation criteria. It is defined as follows:

[TABLE]

where ${N_{correct}^{\left({class-i}\right)}}$ and ${N_{total}^{\left({class-i}\right)}}$ represent the correct number of predictions for the i-th class and the total number of the i-th class respectively.

Implementation Details Our proposed model is composed of 8 dense layers with output channel numbers as $2048\to 1024\to 512\to D+D\to 1024\to 2048\to 1024\to D+D$ , where D represents dimension of the semantic embedding vector and the last two layers represent the feedback network. All activation functions are LeakyReLU [39] with the negative slope of 0.2, except the output of the encoder, the decoder and the regressor which are linear. The mean square error loss is used to reduce the discrepancy between the vectors.

For the discriminative embeddings, it is crucial to select hard triplets, that are active and can contribute to improve the model. We choose the strategy to train the triplet loss[28]: for each batch, the first step is to classify the positive/same classes and the negetive/different classes, and then each image feature is considered as the anchor and its hardest positive image feature such that $\max{d_{1}}$ and hardest negative image feature such that $\min{d_{2}}$ are selected to calculate the loss. The discriminative embeddings and the semantic embeddings have the same dimension, and they jointly embed into the intermediate embedding space, that is, the input dimension of the generator is 2 times of the semantic embedding dimension.

4.2 Experimental Results

The results of the zero-shot learning experiment are given in Table 2. The SS and PS are used in SUN, CUB, AWA1 and AWA2 datasets to achieve the accuracy of each class (top-1 accuracy). Compared with other 12 methods, our proposed model achieves an improvement over state of the art. The SS and PS of the AWA1 are increased by 0.3% and 14.2%, and the SS and PS of the AWA2 are increased by 0.4% and 16.0%, respectively. The SS and PS of SUN are increased by 5.2% and 4.3%. The SS of CUB is increased by 0.3%. We may note that currently no other single method claims the best results on all the datasets simultaneously.

The experiment results show that our proposed model performs better in the coarse-grained datasets (AWA-1, AWA-2) than in the fine-grained datasets (SUN, CUB). The discriminative embedding has more effect on the coarse-grained datasets where the inter-classes semantics are much different, because the triplet loss minimizes the pairwise distances between all similarly labeled examples and separates examples from different classes by a large margin[27], making the learned features be discriminative for object recognition. At the same time, the regressor feedback helps the generator samples be representative and accurate on the corresponding class and achieves the coarse-to-fine process at each iteration, especially it contributes to the fine-grained datasets where the intra-classes have complex semantics. Besides, the large number of classes and relatively fewer training samples in the SUN and CUB make the accuracy improve slightly.

We perform t-SNE visualization[37] to compare the test image features predicted by our proposed model (left) and the original test image features for the AWA2 dataset (right) in Figure 4. Each color represents clustering in the same class and all the image features embed into two dimensions using t-SNE. Compared with the true data, the predicted image features are close to the original ones for most classes, which indicates that our proposed model is able to capture the underlying distribution and performs better on the dataset.

Taking the PS of AWA2 as an example, Figure 5 shows the classification results of 10 unseen classes. The average per class top-1 accuracy is 78.5% shown in Table 2. The data on the diagonal of the confusion matrix indicates the correct number of classifications for each class. For example, there are 535 correct classifications of sheep and the false positives have 2 for dolphins, 95 for bats, 4 for seals, 1 for blue+whale, and 30 for walrus. Note that the class number 43 and 46 have relatively small number of correct in the confusion matrix with a high accuracy, as they have small total number in the dataset.

Different colors of the class prediction error chart from bottom to top represent the class number 40-49 in order, which can intuitively show the proportion of correct predictions of per class. Its vertical axis corresponds to the number of predicted classes in the left part, and it is observed that all the 10 unseen classes have a high accuracy.

The ROC curve and the AUC value visualize the tradeoff between the specificity (false positive rate) and the sensitivity (true positive rate) as a measure of the performance of the classifier[38]. Figure 6 shows the results of the ROC curve and the AUC value for the KNN and the SVM classifier on the PS of AWA2. Each chart has 10 curves, and each curve represents the result classification of one class. Compared the two parts of Figure 6, we can find that the ROC curves of the 10 unseen classes in the right part are close to the top-left corner of the plot, where the AUC value is still higher than 0.9 for the lowest ROC of class 41. The ROC curve for the KNN classifier changes gently, i.e., the maximization of the true positive rate while minimizing the false positive rate[38] is smaller than the ROC curve of the SVM, and the AUC values are smaller than the AUC values of the SVM. The SVM classifier significantly performs better on our proposed model.

4.3 The Generalized Zero-Shot Learning

In real world applications, image classification problems do not have access to whether a novel image belongs to a seen or unseen class in advance. The generalized setting where both seen and unseen classes are used during testing is considered. Hence, generalized zero-shot learning is more meaningful from a practical point of view[4].

Here, we use the same models trained on the PS of datasets and evaluate the performance of the generalized zero-shot learning, and details of the PS is showed in Table 1. The SVM is evaluated separately on both the seen classes and the unseen classes. We use the harmonic mean of the ${{\cal Y}_{\cal S}}$ and ${{\cal Y}_{\cal U}}$ accuracies[4] as a measure of evaluating the generalized zero-shot learning:

[TABLE]

where ${ac{c_{{{\cal Y}_{\cal S}}}}}$ and ${ac{c_{{{\cal Y}_{\cal U}}}}}$ represent the accuracy of the seen ( ${{\cal Y}_{\cal S}}$ ) and unseen ( ${{\cal Y}_{\cal U}}$ ) classes respectively.

The results for the generalized zero-shot learning is shown in Table 3. In the generalized zero-shot learning, our model accuracy is significantly improved on the SUN, CUB, AWA1 and AWA2, which are increased by 22%, 19.3%, 48.6% and 47.5%, respectively. On the coarse-grained datasets (AWA-1, AWA-2), the increase is larger than that on the fine-grained datasets (SUN, CUB).

As shown in Table 3, the generalized zero-shot learning results are lower than zero-shot learning results, this is due to the fact that training classes act as distractors for the image features that come from test classes[4]. Our proposed model can reconstruct the original image features and alleviate the problem of projection domain shift[9]. In the generalized zero-shot learning, more complicated techniques are necessary and our model studies the problem from a new perspective.

On the basis of the above, the discriminative embedding regulates the inter and intra class distances between the learned features and preserves the discriminative information. This clusters the same classes and separates the different classes, which benefits the learned features to be discriminative. The semantic embedding is used for generalizing the semantic knowledge from the seen classes to an unseen class. It joins in the intermediate embedding space, making the generator contains the image feature and the semantic information. The generator combines the semantic embeddings with the discriminative embeddings and utilizes the correlation of them to generate samples. The regressor feedback provides a generalization to the semantic space and the discriminative embedding space. The recurrent structure is trained to produce better samples at each iteration, realizing the coarse-to-fine process. This weakens the interference between seen and unseen classes and alleviates the susceptibility to domain shift. The significant improvement in the GZSL strongly suggests our proposed model is robust and universal.

5 Conclusion and future work

We propose a discriminative embedding autoencoder with a regressor feedback model for ZSL. The autoencoder is used for generating samples for classification of the unseen classes. For the classes-specific and high-quality unseen classes samples, we have two contributions about the models. The discriminative embedding regulates the inter/intra class distances between the learned features, which is learned from the image features. The encoder acts as the discriminator to learn the image features from high dimension to low dimension. The intermediate embedding space is jointly composed of the discriminative and semantic embedding space. The decoder aims to reconstruct the original image feature and provide a generalization to the unseen classes. The feedback mechanism allows the network to carry the reconstructed samples back to previous layers and refine the discriminative embedding and the semantic embedding. The recurrent structure is trained to produce better output at each iteration, realizing the coarse-to-fine process. The final goal of ZSL is to classify the unseen classes, and all the above operations are to generate better unseen classes samples, making the classifier more accurate. The experiment results show that our proposed model compares favorably with the state-of-the-art models on four benchmark datasets, especially the accuracy is significantly improved in the GZSL.

There are several improvements for the future work. The autoencoder can be replaced with any generative model such as GAN[39] or many variants as well. Exploration of more intricate forms of the attribute relations is used for classification of the unseen classes. The intermediate embedding space can fuse the multiple semantic representation of classes, etc.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] I. Biederman, ”Recognition-by-components: a theory of human image understanding.” Psychological review , vol. 94, no. 2, p. 115, 1987.
2[2] Y. Fu, T. Xiang, Y.-G. Jiang, X. Xue, L. Sigal, and S. Gong, ”Recent advances in zero-shot recognition,” ar Xiv preprint ar Xiv:1710.04837 , 2017. [Online]. Available: http://arxiv.org/abs/1710.04837
3[3] P. Morgado and N. Vasconcelos, ”Semantically consistent regularization for zero-shot recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) , 2017, pp. 6060–6069.
4[4] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata, ”Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly,” in IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) , 2018.
5[5] L. Zhang, T. Xiang, and S. Gong, ”Learning a deep embedding model for zero-shot learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) , 2017, pp. 2021–2030.
6[6] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele, ”Latent embeddings for zero-shot classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) , 2016, pp. 69–77.
7[7] S. Reed, Z. Akata, H. Lee, and B. Schiele, ”Learning deep representations of fine-grained visual descriptions,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog. (CVPR) , 2016, pp. 49–58.
8[8] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean, ”Zero-shot learning by convex combination of semantic embeddings,” ar Xiv preprint ar Xiv:1312.5650 , 2013. [Online]. Available: http://arxiv.org/abs/1312.5650