Adaptive Adjustment with Semantic Feature Space for Zero-Shot   Recognition

Jingcai Guo; Song Guo

arXiv:1904.00170·cs.CV·April 2, 2019

Adaptive Adjustment with Semantic Feature Space for Zero-Shot Recognition

Jingcai Guo, Song Guo

PDF

Open Access

TL;DR

This paper introduces a novel zero-shot recognition framework that adaptively adjusts the semantic feature space to address domain shift and hubness issues, leading to improved recognition of unseen classes.

Contribution

It is the first to propose adaptive adjustment of semantic feature space in zero-shot recognition, enhancing model robustness and training efficiency.

Findings

01

Significant performance improvements over existing methods.

02

Effective handling of domain shift and hubness problems.

03

Efficient training framework for zero-shot recognition.

Abstract

In most recent years, zero-shot recognition (ZSR) has gained increasing attention in machine learning and image processing fields. It aims at recognizing unseen class instances with knowledge transferred from seen classes. This is typically achieved by exploiting a pre-defined semantic feature space (FS), i.e., semantic attributes or word vectors, as a bridge to transfer knowledge between seen and unseen classes. However, due to the absence of unseen classes during training, the conventional ZSR easily suffers from domain shift and hubness problems. In this paper, we propose a novel ZSR learning framework that can handle these two issues well by adaptively adjusting semantic FS. To the best of our knowledge, our work is the first to consider the adaptive adjustment of semantic FS in ZSR. Moreover, our solution can be formulated to a more efficient framework that significantly boosts the…

Tables2

Table 1. Table 2 : Comparison with State-of-the-art Competitors

Method	AWA		CUB		aPa&Y		ImageNet
Method	SS	ACC	SS	ACC	SS	ACC	SS	ACC
DeViSE [5] ( ${}^{'}13$ )	A/W	56.7/50.4	A/W	33.5	-	-	A/W	12.8
DAP [1] ( ${}^{'}14$ )	A	60.1	A	-	A	38.2	-	-
MTMDL [14] ( ${}^{'}14$ )	A/W	63.7/55.3	A/W	32.3	-	-	-	-
ESZSL [6] ( ${}^{'}15$ )	A	75.3	A	48.7	A	24.3	-	-
SSE [7] ( ${}^{'}15$ )	A	76.3	A	30.4	A	46.2	-	-
RRZSL [4] ( ${}^{'}15$ )	A	80.4	A	52.4	A	48.8	W	-
Ba et al. [15] ( ${}^{'}15$ )	A/W	69.3/58.7	A/W	34.0	-	-	-	-
AMP [8] ( ${}^{'}16$ )	A+W	66.0	A+W	-	-	-	A+W	13.1
JLSE [16] ( ${}^{'}16$ )	A	80.5	A	41.8	A	50.4	-	-
SynC^struct [9] ( ${}^{'}16$ )	A	72.9	A	54.4	-	-	-	-
MLZSC [8] ( ${}^{'}16$ )	A	77.3	A	43.3	-	53.2	-	-
SS-voc [17] ( ${}^{'}16$ )	A/W	78.3/68.9	A/W	-	-	-	A/W	16.8
SAE [18] ( ${}^{'}17$ )	A	84.7	A	61.2	A	55.1	W	26.3
CVAE-ZSL [19] ( ${}^{'}17$ )	A	71.4	A	52.1	-	-	-	-
CLN+KRR [10] ( ${}^{'}17$ )	A	81.0	A	58.6	-	-	-	-
MFMR [11] ( ${}^{'}17$ )	A	76.6	A	46.2	A	46.4	-	-
RELATION NET [20] ( ${}^{'}18$ )	A	84.5	A	62.0	-	-	-	-
CAPD-ZSL [2] ( ${}^{'}18$ )	A	80.8	A	45.3	A	55.0	W	23.6
Ours	A	88.8	A	64.7	A	56.2	W	27.1

Table 2. Table 1 : Dataset Settings. SC/UC: Seen/Unseen Class; SS: Semantic Space, A: Attributes, W: Word Vectors.

Dataset	Instances	SC	UC	SS	Accuracy
AWA	30475	40	10	A	Hit@1
CUB	11788	150	50	A	Hit@1
aPa&Y	15339	20	12	A	Hit@1
ImageNet	$2.54 \times 10^{5}$	1000	360	W	Hit@5

Equations27

c (x_{i}^{(u)}) = argmax_{cϵ C^{(u)}} Ω (f_{v \to s} (x_{i}^{(u)}), p^{(u)}),

c (x_{i}^{(u)}) = argmax_{cϵ C^{(u)}} Ω (f_{v \to s} (x_{i}^{(u)}), p^{(u)}),

min i = 1 \sum m f_{v \to s} (x_{i}^{(s)}) - p_{i}^{(s)}^{2},

min i = 1 \sum m f_{v \to s} (x_{i}^{(s)}) - p_{i}^{(s)}^{2},

min i = 1 \sum m x_{i}^{(s)} - f_{v ⇌ s} (x_{i}^{(s)})^{2}, s . t . f_{v \to s} (x_{i}^{(s)}) = p_{i}^{(s)},

min i = 1 \sum m x_{i}^{(s)} - f_{v ⇌ s} (x_{i}^{(s)})^{2}, s . t . f_{v \to s} (x_{i}^{(s)}) = p_{i}^{(s)},

p^{(s_{i})}^{'} = λ_{1} p^{(s_{i})} + γ_{1} \frac{1}{z} j = 1 \sum z f_{v \to s} (x_{j}^{(s_{i})}),

p^{(s_{i})}^{'} = λ_{1} p^{(s_{i})} + γ_{1} \frac{1}{z} j = 1 \sum z f_{v \to s} (x_{j}^{(s_{i})}),

p^{(u_{i})}^{'} = λ_{2} p^{(u_{i})} + γ_{2} j = 1 \sum k \frac{Ω ( p ^{(u_{i})} , p ^{(s_{j})} )}{\sum Ω} \cdot p^{(s_{j})},

p^{(u_{i})}^{'} = λ_{2} p^{(u_{i})} + γ_{2} j = 1 \sum k \frac{Ω ( p ^{(u_{i})} , p ^{(s_{j})} )}{\sum Ω} \cdot p^{(s_{j})},

r = i = 1 \sum n j = 1 \sum d f_{v \to s} (x_{j}^{(s_{i})}) - O_{i}^{2},

r = i = 1 \sum n j = 1 \sum d f_{v \to s} (x_{j}^{(s_{i})}) - O_{i}^{2},

J = i = 1 \sum m x_{i}^{(s)} - f_{v ⇌ s} (x_{i}^{(s)})^{2} + α i = 1 \sum n f_{v \to s} (x_{i}^{(s)}) - O_{y_{i}^{(s)}}^{2},

J = i = 1 \sum m x_{i}^{(s)} - f_{v ⇌ s} (x_{i}^{(s)})^{2} + α i = 1 \sum n f_{v \to s} (x_{i}^{(s)}) - O_{y_{i}^{(s)}}^{2},

s . t . f_{v \to s} (x_{i}^{(s)}) = p_{i}^{(s)} .

J = ∥ X - W^{'} WX ∥^{2} + α ∥ WX - O ∥^{2},

J = ∥ X - W^{'} WX ∥^{2} + α ∥ WX - O ∥^{2},

s . t . WX = P,

J = \frac{1}{2} X - W^{⊤} P^{2} + \frac{α}{2} ∥ WX - O ∥^{2},

J = \frac{1}{2} X - W^{⊤} P^{2} + \frac{α}{2} ∥ WX - O ∥^{2},

s . t . WX = P .

J = \frac{1}{2} X^{⊤} - P^{⊤} W^{2} + \frac{α}{2} ∥ WX - O ∥^{2} + \frac{β}{2} ∥ WX - P ∥^{2} .

J = \frac{1}{2} X^{⊤} - P^{⊤} W^{2} + \frac{α}{2} ∥ WX - O ∥^{2} + \frac{β}{2} ∥ WX - P ∥^{2} .

P P^{⊤} W + (α + β) WX X^{⊤} - [(1 + β) P + α O] X^{⊤} = 0 .

P P^{⊤} W + (α + β) WX X^{⊤} - [(1 + β) P + α O] X^{⊤} = 0 .

LW + WR + M = 0,

LW + WR + M = 0,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning

Full text

Adaptive Adjustment with Semantic Feature Space

for Zero-Shot Recognition

Abstract

In most recent years, zero-shot recognition (ZSR) has gained increasing attention in machine learning and image processing fields. It aims at recognizing unseen class instances with knowledge transferred from seen classes. This is typically achieved by exploiting a pre-defined semantic feature space (FS), i.e., semantic attributes or word vectors, as a bridge to transfer knowledge between seen and unseen classes. However, due to the absence of unseen classes during training, the conventional ZSR easily suffers from domain shift and hubness problems. In this paper, we propose a novel ZSR learning framework that can handle these two issues well by adaptively adjusting semantic FS. To the best of our knowledge, our work is the first to consider the adaptive adjustment of semantic FS in ZSR. Moreover, our solution can be formulated to a more efficient framework that significantly boosts the training. Extensive experiments show the remarkable performance improvement of our model compared with other existing methods.

**Index Terms— ** Adaptive Adjustment, Zero-Shot Recognition, Semantic Features, Domain Shift, Hubness.

1 Introduction and Related Work

Zero-shot recognition (ZSR) imitates human ability in recognizing new unseen classes. It is achieved by exploiting labeled seen class instances and certain knowledge that is shared between seen and unseen classes [1, 2]. This knowledge, i.e., attributes, exists in a high dimensional vector space called semantic feature space (FS). The attributes are meaningful high-level information about instances such as their shapes, colors, components, textures, etc. Semantic features describe a class or an instance, in contrast to the typical classification, which names an instance. Intuitively, the similar classes have similar patterns in the semantic FS. These particular patterns are called prototypes. In ZSR, the common practice is first to map an unseen class instance from its original FS, i.e., visual FS, to semantic FS by a mapping function trained on seen classes. Then with such semantic features, we search its most closely related prototype whose corresponding class is set to this instance.

However, as one of the key building blocks in ZSR, the mapping function is trained solely on seen classes. Although the knowledge is shared by both seen and unseen classes, the training and testing classes are intuitively different. Due to the absence of unseen classes during training, ZSR easily suffers from the domain shift problem [3] which refers to the phenomenon that when mapping unseen class instances from their visual to semantic FS, the obtained results may shift away from the real ones (prototypes). Moreover, during searching step, a small number of prototypes may easily become the most related prototypes to most testing unseen class instances. This challenge is the so-called hubness problem [4].

To deal with these issues, several transductive learning based methods [3] assume that the unseen class instances (unlabelled) are available at once during training. DeViSE [5] trains a linear mapping between visual and semantic FS by an effective ranking loss formulation. ESZSL [6] utilizes the square loss to learn the bilinear compatibility and adds regularization to the objective with respect to Frobenius norm. SSE [7] uses the mixture of seen class parts as the intermediate FS. AMP [8] embeds the visual features into the attribute space. SynCstruct [9] and CLN+KRR [10] jointly embed several kinds of textual features and visual features to ground attributes. MFMR [11] leverages the sophisticated technique of matrix tri-factorization with manifold regularizers to enhance the mapping between visual and semantic FS. With the popularity of generative adversarial networks (GANs), GANZrl [12] applies GANs to synthesize instances with specified semantics to cover a higher diversity of seen classes. Instead, GAZSL [13] leverages GANs to imagine unseen classes from text descriptions. Despite the efforts made, the domain shift and hubness problems are still open issues.

In this paper, we propose a novel model based on a specific learning framework to adaptively adjust the semantic FS by considering both the prototypes and the global distribution of data (Fig.1, bottom). Specifically, some key building blocks of ZSR, i.e., the semantic FS and the distribution of data, do not seem to receive comparable attention. Conventional ZSR models normally regard the pre-defined semantic FS as unchangeable and keep each prototype fixed during training. However, we observe that when mapping unseen class instances to semantic FS, the obtained features are quite concentrated. Furthermore, some pre-defined prototypes are also too closely distributed. These deficiencies affect models’ ability to adapt and generalize to unseen classes. As we know, the process of human beings understanding things is constantly improving. Similarly, we argue the semantic FS also needs to be adjusted in order to mitigate the domain shift and hubness problems. Moreover, we propose to combine the adjustment with a cycle mapping, which first maps the instance from visual to semantic FS and then vice versa, guaranteeing that the mapping obtains more robust results to further alleviate the domain shift problem (Fig.1, top). We formulate the above steps to a more efficient framework that significantly boosts the training of ZSR.

2 Proposed Approach

2.1 Cycle Mapping

In zero-shot recognition, for the visual to semantic mapping, the recognition can be described as:

[TABLE]

where $c\left(x^{(u)}_{i}\right)$ is the predicted class of unseen class instance $x^{(u)}_{i}$ that belongs to unseen classe set $C^{(u)}$ , $\Omega(\cdot,\cdot)$ is a similarity measurement, $p^{(u)}$ is the unseen class prototypes and $f_{v\rightarrow s}(\cdot)$ is the mapping function trained on labeled seen classes that maps from visual to semantic feature space. The training can be described as:

[TABLE]

where $m$ is the number of seen class instances, $x^{s}_{i}$ is the $i$ -th instance of seen classes and $p_{i}^{(s)}$ is the prototype corresponding to $x^{s}_{i}$ . Similarly, for the visual to semantic mapping, we aim to find a mapping that reversely maps the semantic to visual FS. In our model, we combine both mapping directions to a cycle mapping. It can be formulated to an encoder-decoder structure $f_{v\rightleftharpoons s}\left(x^{(s)}\right)=f_{s\rightarrow v}\left(f_{v\rightarrow s}\left(x^{(s)}\right)\right)$ . The training can be described as:

[TABLE]

where $f_{v\rightleftharpoons s}\left(x^{s}_{i}\right)$ maps $x^{s}_{i}$ from visual to semantic FS, then reconstructs it by reversely mapping it from semantic to visual FS. Constraint $f_{v\rightarrow s}\left(x^{(s)}_{i}\right)=p_{i}^{(s)}$ is applied to learn the exact mapping between visual and semantic FS.

2.2 Adaptive Adjustment

Following the cycle mapping, we propose to adaptively adjust the semantic FS in the following steps.

1). For the adjustment of seen class prototypes, we focus on the centroid and current distribution of each instance within the semantic FS:

[TABLE]

where ${p^{(s_{i})}}^{\prime}$ and $p^{(s_{i})}$ are the updated and original prototype of the $i$ -th seen class, respectively; $x^{(s_{i})}_{j}$ is the instance and $z$ is the number of instances belonging to this class; $f_{v\rightarrow s}\left(x^{(s_{i})}_{j}\right)$ calculates the semantic features of $x^{(s_{i})}_{j}$ ; $\lambda_{1}$ , $\gamma_{1}$ are two hyper-parameters that control the balance of these two terms.

2). For the adjustment of unseen class prototypes, because unseen class instances are not available during training in ZSR, we cannot adjust the prototypes straightway. Instead, we propose to adjust the unseen class prototypes by associating with seen class prototypes:

[TABLE]

where ${p^{(u_{i})}}^{\prime}$ and $p^{(u_{i})}$ are the updated and original prototype of the $i$ -th unseen class, respectively; $p^{(s_{j})}$ ( $j\in\left[1,k\right]$ ) are the $k$ nearest seen class prototype neighbours of $p^{(u_{i})}$ ; $\Omega(\cdot,\cdot)$ is a similarity measurement; $\lambda_{2}$ , $\gamma_{2}$ are also two hyper-parameters.

3). For the adjustment of global data distribution, we propose a regularization term which considers both the diversity among different class instances and the identity within same class instances:

[TABLE]

where $n$ is the number of seen classes, $d$ is the number of instances belonging to $i$ -th seen class $s_{i}$ , $O_{i}$ is the semantic centroid of the $i$ -th seen class, which can be calculated by $\frac{1}{d}\underset{j}{\sum}f_{v\rightarrow s}(x^{(s_{i})}_{j})$ .

2.3 Unified Framework

Our model optimizes alternately by Eqs. (3)-(6). Specifically, we first optimize Eq. (3) to obtain an initial weight of mapping function. Then Eqs. (4), (5) are performed to adjust prototypes. Lastly, Eqs. (3), (6) are jointly optimized to obtain the updated weight and adaptively adjust the global distribution at the same time. These steps are performed iteratively to reach an optimum (Fig. 1).

The adaptive adjustment can be formulated and combined with cycle mapping, i.e., an encoder-decoder structural model, to a unified framework. Combined with Eqs. (3), (6), the overall objective can be described as:

[TABLE]

We use a hyper-parameter $\alpha$ to balance the importance of these two terms. To simplify, we rewrite the objective function to matrix form:

[TABLE]

where $\mathbf{W}$ and $\mathbf{W}^{\prime}$ are the mapping weights of $f_{v\rightarrow s}(\cdot)$ and $f_{s\rightarrow v}(\cdot)$ , respectively. Moreover, to further optimize, we use tied weights [21] to half the parameters. Then by substituting $\mathbf{W}\mathbf{X}$ with $\mathbf{P}$ , our objective can be rewritten as:

[TABLE]

Eq. (9) is with a hard constraint $\mathbf{W}\mathbf{X}=\mathbf{P}$ that is not easy to solve efficiently. So we relax it to $\beta\left\|\mathbf{W}\mathbf{X}-\mathbf{P}\right\|^{2}$ , where $\beta$ is also a hyper-parameter. We also use trace properties $\mathrm{Tr}(\mathbf{X})=\mathrm{Tr}(\mathbf{X}^{\top})$ and $\mathrm{Tr}(\mathbf{W}^{\top}\mathbf{P})=\mathrm{Tr}(\mathbf{P}^{\top}\mathbf{W})$ , then the objective can be further rewritten as:

[TABLE]

To solve it, we take a derivative of Eq. (10) with respect to $\mathbf{W}$ , and set it to zero, i.e.,

[TABLE]

We denote $\mathbf{L}=\mathbf{P}\mathbf{P}^{\top}$ , $\mathbf{R}=(\alpha+\beta)\mathbf{X}\mathbf{X}^{\top}$ and $\mathbf{M}=-\left[(1+\beta)\mathbf{P}+\alpha\mathbf{O}\right]\mathbf{X}^{\top}$ . Therefore, Eq. (11) can be rewritten as:

[TABLE]

which is exactly in the standard form of the generalized Lyapunov equation and can be solved efficiently by an existing solver [22].

3 Experiment

3.1 Dataset and Setting

Our model is evaluated on four benchmark datasets in ZSR, including Animals with Attributes (AWA) [1], CUB-200-2011 Birds (CUB) [23], aPascal&Yahoo (aPa&Y) [24] and ILSVRC2012/ILSVRC2010 (ImageNet) [25]. We adopt Hit@k accuracy [5] to evaluate the performance. It is a widely used evaluation criterion in ZSR, and it refers to predict top-k possible class labels of the testing unseen class instance. The model classifies the instance correct, if and only if the ground truth is within these top-k class labels. Following most methods, the datasets & settings are shown in Table 1.

The features are extracted from GoogleNet [26] for the visual feature space. Each image instance is presented by a 1024-dimensional vector. In our model, the cosine similarity is adopted for $\Omega(\cdot,\cdot)$ . Hyper-parameter $\lambda_{1}$ / $\gamma_{1}$ are set to 0.75/0.25, and $\lambda_{2}$ / $\gamma_{2}$ are set to 0.8/0.2, respectively by grid-search [27]. Our model strictly complies with the zero-shot setting that the training of mapping function only relies on seen classes. All selected comparison methods are under the same settings.

3.2 Results and Analysis

The comparison results are shown in Table 2. We can see our model outperforms all competitors with great advantages. The accuracy achieves 88.8%, 64.7%, 56.2% and 27.1% for AWA, CUB, aPa&Y and ImageNet respectively. Next, we introduce some further analysis. Firstly, we explore the influence of parameter k, which refers to the k nearest neighbors used in the adjustment for unseen class prototypes (Eq. (5)). We evaluate on AWA and CUB for a smaller and larger k-search respectively, and the results are shown in Fig.2 and Fig.3. For AWA, the preferred range is $k\in[7,17]$ and the most preferred $k$ is around 12. For CUB, the preferred range is $k\in[4,19]$ and the most preferred $k$ is around 16. Secondly, we evaluate the training time. We compare our model with ESZSL [6], SSE [7] and AMP [8]. The results are shown in Fig.4. We can see that the training of our model is much faster than these competitors.

4 Conclusion

In this paper, we propose a novel model based on a unified learning framework for zero-shot recognition. Our model adaptively adjusts semantic feature space by considering both the prototypes and the global distribution of data. Moreover, our model can be formulated to a more efficient framework and significantly boosts the training. Extensive experiments verified the effectiveness of our model and obtained remarkable performance compared with other existing representative methods.

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling, “Attribute-based classification for zero-shot visual object categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 36, no. 3, pp. 453–465, 2014.
2[2] Shafin Rahman, Salman Khan, and Fatih Porikli, “A unified approach for conventional zero-shot, generalized zero-shot, and few-shot learning,” IEEE Transactions on Image Processing , vol. 27, no. 11, pp. 5652–5667, 2018.
3[3] Yanwei Fu, Timothy M Hospedales, Tao Xiang, and Shaogang Gong, “Transductive multi-view zero-shot learning,” IEEE transactions on pattern analysis and machine intelligence , vol. 37, no. 11, pp. 2332–2345, 2015.
4[4] Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, and Yuji Matsumoto, “Ridge regression, hubness, and zero-shot learning,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases . Springer, 2015, pp. 135–151.
5[5] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al., “Devise: A deep visual-semantic embedding model,” in Advances in neural information processing systems , 2013, pp. 2121–2129.
6[6] Bernardino Romera-Paredes and Philip Torr, “An embarrassingly simple approach to zero-shot learning,” in International Conference on Machine Learning , 2015, pp. 2152–2161.
7[7] Ziming Zhang and Venkatesh Saligrama, “Zero-shot learning via semantic similarity embedding,” in Proceedings of the IEEE International Conference on Computer Vision , 2015, pp. 4166–4174.
8[8] Maxime Bucher, Stéphane Herbin, and Frédéric Jurie, “Improving semantic embedding consistency by metric learning for zero-shot classiffication,” in European Conference on Computer Vision . Springer, 2016, pp. 730–746.