Improving Generalization via Attribute Selection on Out-of-the-box Data

Xiaofeng Xu; Ivor W. Tsang; and Chuancai Liu

arXiv:1907.11397·cs.CV·October 29, 2019

Improving Generalization via Attribute Selection on Out-of-the-box Data

Xiaofeng Xu, Ivor W. Tsang, and Chuancai Liu

PDF

Open Access

TL;DR

This paper proposes an attribute selection method using pseudo data generated by a generative model to improve zero-shot learning, achieving state-of-the-art results.

Contribution

It introduces an iterative attribute selection strategy leveraging out-of-the-box pseudo data to enhance generalization in ZSL.

Findings

01

IAS significantly improves ZSL performance

02

The method achieves state-of-the-art results

03

Selected attributes generalize well to unseen data

Abstract

Zero-shot learning (ZSL) aims to recognize unseen objects (test classes) given some other seen objects (training classes), by sharing information of attributes between different objects. Attributes are artificially annotated for objects and treated equally in recent ZSL tasks. However, some inferior attributes with poor predictability or poor discriminability may have negative impacts on the ZSL system performance. This paper first derives a generalization error bound for ZSL tasks. Our theoretical analysis verifies that selecting the subset of key attributes can improve the generalization performance of the original ZSL model, which utilizes all the attributes. Unfortunately, previous attribute selection methods are conducted based on the seen data, and their selected attributes have poor generalization capability to the unseen data, which is unavailable in the training stage of ZSL…

Tables5

Table 1. Table 1: Notations and Descriptions.

Notation	Description	Notation	Description
$D_{s}$	training data (seen)	$N_{s}$	#training samples
$D_{u}$	test data (unseen)	$N_{u}$	#test samples
$D_{g}$	out-of-the-box data	$N_{g}$	#generated samples
$𝒳$	image features	$d$	#dimension of features
$𝒴_{s}$	training classes (seen)	$K$	#training classes
$𝒴_{u}$	test classes (unseen)	$L$	#test classes
$𝐀$	attribute matrix	$𝐚_{y}$	attribute vector of label $y$
$N_{a}$	#all the attributes	$𝒜$	set of original attributes
$𝐬$	selection vector	$𝒮$	subset of selected attributes

Table 2. Table 2: Statistic Information of Four Datasets (AwA, aPY, CUB and SUN) with Two Dataset Splits (SS and PS).

Dataset	#Attributes	Classes			Images (SS)		Images (PS)
Dataset	#Attributes	#Total	#Training	#Test	#Training	#Test	#Training	#Test
AwA	85	50	40	10	24295	6180	19832	5685
aPY	64	32	20	12	12695	2644	5932	7924
CUB	312	200	150	50	8855	2933	7057	2967
SUN	102	717	645	72	12900	1440	10320	1440

Table 3. Table 3: Zero-Shot Classification Accuracy Comparison on Benchmarks. Numbers in Brackets are Relative Performance Gains. ‘-’ Indicates that no Reported Results are Available.

$†$ :Results published in the paper. $‡$ :Results reproduced.
Methods	AwA		aPY		CUB		SUN
Methods	SS	PS	SS	PS	SS	PS	SS	PS
DAP $‡$	64.44	46.22	35.73	39.67	43.47	40.23	41.25	45.83
DAP+AS $†$	-	48.29	-	34.87	-	41.55	-	42.27
DAP+IAS	86.65(+22.21)	71.88(+25.66)	57.12(+21.39)	43.06(+3.39)	55.35(+11.88)	54.22(+13.99)	47.85(+6.60)	50.56(+4.73)
LatEm $‡$	71.51	48.33	24.43	34.66	50.38	48.57	58.75	55.13
LatEm+AS $†$	-	59.07	-	38.82	-	52.82	-	58.09
LatEm+IAS	81.83(+10.32)	67.13(+18.80)	47.22(+22.79)	48.36(+13.70)	56.05(+5.67)	52.14(+3.57)	59.03(+0.28)	56.18(+1.05)
SAE $‡$	79.19	48.48	8.33	8.33	26.41	24.65	36.94	32.78
SAE+IAS	87.95(+8.76)	70.36(+21.88)	45.90(+37.57)	38.53(+30.20)	48.21(+21.80)	42.85(+18.20)	45.14(+8.20)	42.22(+9.44)
MFMR $‡$	86.06	68.04	52.16	34.09	43.09	39.55	50.49	53.33
MFMR+IAS	87.10(+1.04)	71.37(+3.33)	58.51(+6.35)	37.67(+3.58)	51.40(+8.31)	47.89(+8.34)	58.47(+7.98)	58.26(+4.93)
GANZrl $†$	86.23	-	-	-	62.56	-	-	-
GANZrl+IAS	88.51(+2.28)	-	-	-	66.76(+4.20)	-	-	-
fVG $†$	-	70.30	-	-	-	72.90	-	65.60
fVG+IAS	-	74.28(+3.98)	-	-	-	74.53(+1.63)	-	68.39(+2.79)
LLAE $†$	85.24	-	56.16	-	61.93	-	-	-
LLAE+IAS	88.95(+3.71)	-	60.88(+4.72)	-	64.42(+2.49)	-	-	-

Table 4. Table 4: Distances Between Different Data Distributions. 𝒳 g subscript 𝒳 𝑔 \mathcal{X}_{g} Indicates the Generated Out-of-the-Box Data, 𝒳 u subscript 𝒳 𝑢 \mathcal{X}_{u} Indicates the Unseen Test Data and 𝒳 s subscript 𝒳 𝑠 \mathcal{X}_{s} Indicates the Seen Training Data.

Metrics	$𝒳_{g} \sim 𝒳_{u}$	$𝒳_{g} \sim 𝒳_{s}$	$𝒳_{s} \sim 𝒳_{u}$
Wasserstein Distance	5.99	19.09	18.97
KL Divergence	0.321	0.630	0.703
Hellinger Distance	7.78	16.87	17.15
Bhattacharyya Distance	0.0808	0.159	0.176

Table 5. Table 5: Subsets of the Key Attributes Selected by DAP, LatEm and SAE on AwA Dataset(20 Attributes Selected Out of 85 Attributes). Attributes Appear in All Three Methods are in Boldface, and Appear in Two Methods are in Italics.

DAP		LatEm		SAE
ground	fish	hands	pads	black	paws
hands	fields	ground	forest	ground	ocean
plains	smelly	bipedal	gray	pads	yellow
tunnels	pads	claws	coastal	gray	group
forest	yellow	black	yellow	hands	tunnels
tail	scavenger	fish	strainteeth	hooves	white
gray	swims	fields	horns	domestic	fish
hibernate	black	paws	scavenger	tail	fields
hooves	paws	blue	tail	skimmer	forest
jungle	weak	hooves	white	arctic	scavenger

Equations56

L (y, f (x; W)) = \frac{1}{N _{s}} n = 1 \sum N_{s} l (y_{n}, f (x_{n}; W)) + Ω (W),

L (y, f (x; W)) = \frac{1}{N _{s}} n = 1 \sum N_{s} l (y_{n}, f (x_{n}; W)) + Ω (W),

f (x; W) = ar g max_{y \in Y} F (x, y; W),

f (x; W) = ar g max_{y \in Y} F (x, y; W),

F (x, y; W) = θ (x)^{T} W φ (y),

F (x, y; W) = θ (x)^{T} W φ (y),

L (y, f (x; s, W)) = \frac{1}{N _{s}} n = 1 \sum N_{s} {l_{ZSL} (y_{n}, f (x_{n}; s, W)) + α l_{p} (θ (x_{n}), φ (y_{n}); s) - β l_{v} (θ (x_{n}), μ; s)},

L (y, f (x; s, W)) = \frac{1}{N _{s}} n = 1 \sum N_{s} {l_{ZSL} (y_{n}, f (x_{n}; s, W)) + α l_{p} (θ (x_{n}), φ (y_{n}); s) - β l_{v} (θ (x_{n}), μ; s)},

d (a_{i}, a_{j}) = m = 1 \sum N_{a} Δ (a_{i}^{(m)}, a_{j}^{(m)}),

d (a_{i}, a_{j}) = m = 1 \sum N_{a} Δ (a_{i}^{(m)}, a_{j}^{(m)}),

τ = min_{i \neq = j} d (a_{i}, a_{j}), \forall 1 \leq i, j \leq N_{a} .

τ = min_{i \neq = j} d (a_{i}, a_{j}), \forall 1 \leq i, j \leq N_{a} .

d (f (x), a_{y}) \geq \frac{τ}{2} .

d (f (x), a_{y}) \geq \frac{τ}{2} .

d (f (x), a_{y}) \geq d (f (x), a_{r}) .

d (f (x), a_{y}) \geq d (f (x), a_{r}) .

m = 1 \sum N_{a} Δ (f^{(m)} (x), a_{y}^{(m)}) \geq m = 1 \sum N_{a} Δ (f^{(m)} (x), a_{r}^{(m)}) .

m = 1 \sum N_{a} Δ (f^{(m)} (x), a_{y}^{(m)}) \geq m = 1 \sum N_{a} Δ (f^{(m)} (x), a_{r}^{(m)}) .

\begin{split}d(f(x),\mathbf{a}_{y})&=\sum_{m=1}^{N_{a}}\Delta(f^{(m)}(x),\mathbf{a}_{y}^{(m)})\\ &=\frac{1}{2}\sum_{m=1}^{N_{a}}\Big{\{}\Delta(f^{(m)}(x),\mathbf{a}_{y}^{(m)})+\Delta(f^{(m)}(x),\mathbf{a}_{y}^{(m)})\Big{\}}\\ &\overset{(i)}{\geq}\frac{1}{2}\sum_{m=1}^{N_{a}}\Big{\{}\Delta(f^{(m)}(x),\mathbf{a}_{y}^{(m)})+\Delta(f^{(m)}(x),\mathbf{a}_{r}^{(m)})\Big{\}}\\ &\overset{(ii)}{\geq}\frac{1}{2}\sum_{m=1}^{N_{a}}{\Delta(\mathbf{a}_{y}^{(m)},\mathbf{a}_{r}^{(m)})}\\ &=\frac{1}{2}d(\mathbf{a}_{y},\mathbf{a}_{r})\overset{(iii)}{\geq}\frac{\tau}{2},\end{split}

\begin{split}d(f(x),\mathbf{a}_{y})&=\sum_{m=1}^{N_{a}}\Delta(f^{(m)}(x),\mathbf{a}_{y}^{(m)})\\ &=\frac{1}{2}\sum_{m=1}^{N_{a}}\Big{\{}\Delta(f^{(m)}(x),\mathbf{a}_{y}^{(m)})+\Delta(f^{(m)}(x),\mathbf{a}_{y}^{(m)})\Big{\}}\\ &\overset{(i)}{\geq}\frac{1}{2}\sum_{m=1}^{N_{a}}\Big{\{}\Delta(f^{(m)}(x),\mathbf{a}_{y}^{(m)})+\Delta(f^{(m)}(x),\mathbf{a}_{r}^{(m)})\Big{\}}\\ &\overset{(ii)}{\geq}\frac{1}{2}\sum_{m=1}^{N_{a}}{\Delta(\mathbf{a}_{y}^{(m)},\mathbf{a}_{r}^{(m)})}\\ &=\frac{1}{2}d(\mathbf{a}_{y},\mathbf{a}_{r})\overset{(iii)}{\geq}\frac{\tau}{2},\end{split}

\frac{2 N _{a} B ˉ}{τ},

\frac{2 N _{a} B ˉ}{τ},

d (f (x), a_{y}) = m = 1 \sum N_{a} Δ (f^{(m)} (x), a_{y}^{(m)}) \geq \frac{τ}{2} .

d (f (x), a_{y}) = m = 1 \sum N_{a} Δ (f^{(m)} (x), a_{y}^{(m)}) \geq \frac{τ}{2} .

k \frac{τ}{2} \leq i = 1 \sum N_{u} m = 1 \sum N_{a} Δ (f^{(m)} (x_{i}), a_{y_{i}}^{(m)}) \leq i = 1 \sum N_{u} m = 1 \sum N_{a} B_{m} = N_{u} N_{a} \overset{ˉ}{B},

k \frac{τ}{2} \leq i = 1 \sum N_{u} m = 1 \sum N_{a} Δ (f^{(m)} (x_{i}), a_{y_{i}}^{(m)}) \leq i = 1 \sum N_{u} m = 1 \sum N_{a} B_{m} = N_{u} N_{a} \overset{ˉ}{B},

D \propto \frac{N _{a}}{k _{a}} [4 log (2/ δ) + 8 (d + 1) log (13 N_{a} / k_{a})],

D \propto \frac{N _{a}}{k _{a}} [4 log (2/ δ) + 8 (d + 1) log (13 N_{a} / k_{a})],

p\Bigg{(}e_{ts}\leq e_{tr}+\sqrt{\frac{1}{N_{s}}\bigg{[}D\Big{(}\mathrm{log}\Big{(}\frac{2N_{s}}{D}\Big{)}+1\Big{)}-\mathrm{log}\Big{(}\frac{\eta}{4}\Big{)}\bigg{]}}\Bigg{)}=1-\eta,

p\Bigg{(}e_{ts}\leq e_{tr}+\sqrt{\frac{1}{N_{s}}\bigg{[}D\Big{(}\mathrm{log}\Big{(}\frac{2N_{s}}{D}\Big{)}+1\Big{)}-\mathrm{log}\Big{(}\frac{\eta}{4}\Big{)}\bigg{]}}\Bigg{)}=1-\eta,

L (y, f (x; s, W)) = \frac{1}{N _{g}} n = 1 \sum N_{g} l (y_{n}, f (x_{n}; s, W)) + Ω (W),

L (y, f (x; s, W)) = \frac{1}{N _{g}} n = 1 \sum N_{g} l (y_{n}, f (x_{n}; s, W)) + Ω (W),

f (x; s, W) = ar g max_{y \in Y} F (x, y; s, W),

f (x; s, W) = ar g max_{y \in Y} F (x, y; s, W),

F (x, y; s, W) = θ (x)^{T} W (s \circ φ (y)),

F (x, y; s, W) = θ (x)^{T} W (s \circ φ (y)),

l (y_{n}, f ((x_{n}; s, W)))

l (y_{n}, f ((x_{n}; s, W)))

= y \in Y_{g} \sum r_{n y} [△ (y_{n}, y) + F (x_{n}, y; s, W) - F (x_{n}, y_{n}; s, W)]_{+}

= y \in Y_{g} \sum r_{n y} [△ (y_{n}, y) + θ (x_{n})^{T} W (s \circ φ (y)) - θ (x_{n})^{T} W (s \circ φ (y_{n}))]_{+},

L^{t + 1} = \frac{1}{N _{g}} n = 1 \sum N_{g} l^{t + 1} (y_{n}, f (x_{n}; s^{t + 1}, W^{t + 1})) + Ω (W^{t + 1}), s . t . s_{i} \in s^{t + 1} \sum s_{i} = t + 1, s_{j} \in (s^{t + 1} - s^{t}) \sum s_{j} = 1.

L^{t + 1} = \frac{1}{N _{g}} n = 1 \sum N_{g} l^{t + 1} (y_{n}, f (x_{n}; s^{t + 1}, W^{t + 1})) + Ω (W^{t + 1}), s . t . s_{i} \in s^{t + 1} \sum s_{i} = t + 1, s_{j} \in (s^{t + 1} - s^{t}) \sum s_{j} = 1.

l^{t + 1} = y \in Y_{g} \sum r_{n y} [△ (y_{n}, y) + θ (x_{n})^{T} W^{t + 1} (s^{t + 1} \circ φ (y)) - θ (x_{n})^{T} W^{t + 1} (s^{t + 1} \circ φ (y_{n}))]_{+} .

l^{t + 1} = y \in Y_{g} \sum r_{n y} [△ (y_{n}, y) + θ (x_{n})^{T} W^{t + 1} (s^{t + 1} \circ φ (y)) - θ (x_{n})^{T} W^{t + 1} (s^{t + 1} \circ φ (y_{n}))]_{+} .

\frac{\partial L ^{t + 1}}{\partial W ^{t + 1}} = \frac{1}{N _{g}} n = 1 \sum N_{g} \frac{\partial l ^{t + 1}}{\partial W ^{t + 1}} + \frac{1}{2} α W^{t + 1},

\frac{\partial L ^{t + 1}}{\partial W ^{t + 1}} = \frac{1}{N _{g}} n = 1 \sum N_{g} \frac{\partial l ^{t + 1}}{\partial W ^{t + 1}} + \frac{1}{2} α W^{t + 1},

\frac{\partial l ^{t + 1}}{\partial W ^{t + 1}} = y \in Y_{g} \sum r_{n y} θ (x_{n})^{T} (s^{t} \circ (φ (y) - φ (y_{n}))),

\frac{\partial l ^{t + 1}}{\partial W ^{t + 1}} = y \in Y_{g} \sum r_{n y} θ (x_{n})^{T} (s^{t} \circ (φ (y) - φ (y_{n}))),

s^{t + 1} = ar g min_{s^{t + 1}} \frac{1}{N _{g}} n = 1 \sum N_{g} l^{t + 1} (y_{n}, f (x_{n}; s^{t + 1}, W^{t + 1})) + Ω (W^{t + 1}),

s^{t + 1} = ar g min_{s^{t + 1}} \frac{1}{N _{g}} n = 1 \sum N_{g} l^{t + 1} (y_{n}, f (x_{n}; s^{t + 1}, W^{t + 1})) + Ω (W^{t + 1}),

L_{VAE} (x) = - KL (q (z ∣ x) ∥ p (z)) + \frac{1}{L} l = 1 \sum L log p (x ∣ z^{(l)}),

L_{VAE} (x) = - KL (q (z ∣ x) ∥ p (z)) + \frac{1}{L} l = 1 \sum L log p (x ∣ z^{(l)}),

L_{AVAE} (x, φ (y)) = - KL (q (z ∣ x, φ (y)) ∥ p (z ∣ φ (y))) + \frac{1}{L} l = 1 \sum L log p (x ∣ φ (y), z^{(l)}),

L_{AVAE} (x, φ (y)) = - KL (q (z ∣ x, φ (y)) ∥ p (z ∣ φ (y))) + \frac{1}{L} l = 1 \sum L log p (x ∣ φ (y), z^{(l)}),

a cc = \frac{1}{L} y \in Y_{u} \sum \frac{# correct predictions in y}{# samples in y},

a cc = \frac{1}{L} y \in Y_{u} \sum \frac{# correct predictions in y}{# samples in y},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Geophysical Methods and Applications · Multimodal Machine Learning Applications

Full text

1

Improving Generalization via Attribute Selection on Out-of-the-box Data

**Xiaofeng Xu1,2, Ivor W. Tsang2 and Chuancai Liu1,3

1**School of Computer Science and Engineering, Nanjing University of Science and Technology.

2Centre for Artificial Intelligence, University of Technology Sydney.

3Collaborative Innovation Center of IoT Technology and Intelligent Systems, Minjiang University.

Keywords: Zero-shot learning, attribute selection, out-of-the-box data, generalization error bound

Abstract

Zero-shot learning (ZSL) aims to recognize unseen objects (test classes) given some other seen objects (training classes), by sharing information of attributes between different objects. Attributes are artificially annotated for objects and treated equally in recent ZSL tasks. However, some inferior attributes with poor predictability or poor discriminability may have negative impacts on the ZSL system performance. This paper first derives a generalization error bound for ZSL tasks. Our theoretical analysis verifies that selecting the subset of key attributes can improve the generalization performance of the original ZSL model, which utilizes all the attributes. Unfortunately, previous attribute selection methods are conducted based on the seen data, and their selected attributes have poor generalization capability to the unseen data, which is unavailable in the training stage of ZSL tasks. Inspired by learning from pseudo relevance feedback, this paper introduces the out-of-the-box data, which is pseudo data generated by an attribute-guided generative model, to mimic the unseen data. After that, we present an iterative attribute selection (IAS) strategy which iteratively selects key attributes based on the out-of-the-box data. Since the distribution of the generated out-of-the-box data is similar to the test data, the key attributes selected by IAS can be effectively generalized to test data. Extensive experiments demonstrate that IAS can significantly improve existing attribute-based ZSL methods and achieve state-of-the-art performance.

1 Introduction

With the rapid development of machine learning technologies, especially the rise of deep neural network, visual object recognition has made tremendous progress in recent years (Zheng et al., 2018; Shen et al., 2018). These recognition systems even outperform humans when provided with a massive amount of labeled data. However, it is expensive to collect sufficient labeled samples for all natural objects, especially for the new concepts and many more fine-grained subordinate categories (Zhou et al., 2019). Therefore, how to achieve an acceptable recognition performance for objects with limited or even no training samples is a challenging but practical problem (Palatucci et al., 2009). Inspired by human cognition system that can identify new objects when provided with a description in advance (Murphy, 2004), zero-shot learning (ZSL) has been proposed to recognize unseen objects with no training samples (Cheng et al., 2017; Ji et al., 2019). Since labeled sample is not given for the target classes, we need to collect some source classes with sufficient labeled samples and find the connection between the target classes and the source classes.

As a kind of semantic representation, attributes are widely used to transfer knowledge from the seen classes (source) to the unseen classes (target) (Ma et al., 2017). Attributes play a key role in sharing information between classes and govern the performance of zero-shot classification. In previous ZSL works, all the attributes are assumed to be effective and treated equally. However, as pointed out in Guo et al. (2018), different attributes have different properties, such as the distributive entropy and the predictability. The attributes with poor predictability or poor discriminability may have negative impacts on the ZSL system performance. The poor predictability means that the attributes are hard to be correctly recognized from the feature space, and the poor discriminability means that the attributes are weak in distinguishing different objects. Hence, it is obvious that not all the attributes are necessary and effective for zero-shot classification.

Based on these observations, selecting the key attributes, instead of using all the attributes, is significant and necessary for constructing ZSL models. Guo et al. (2018) proposed the zero-shot learning with attribute selection (ZSLAS) model, which selects attributes by measuring the distributive entropy and the predictability of attributes based on the training data. ZSLAS can improve the performance of attribute-based ZSL methods, while it suffers from the drawback of generalization. Since the training classes and the test classes are disjoint in ZSL tasks, the training data is bounded by the box cut by attributes (illustrated in Figure 1). Therefore, the attributes selected based on the training data have poor generalization capability to the unseen test data.

To address the drawback, this paper derives a generalization error bound for ZSL problem. Since attributes for ZSL task is literally like the codewords in the error correcting output code (ECOC) model (Dietterich et al., 1994), we analyze the bound from the perspective of ECOC. Our analyses reveal that the key attributes need to be selected based on the data which is out of the box (i.e. the distribution of the training classes). Considering that test data is unavailable during the training stage for ZSL tasks, inspired by learning from pseudo relevance feedback (Miao et al., 2016), we introduce the out-of-the-box111The out-of-the-box data is generated based on the training data and the attribute representation without extra information, which follows the standard zero-shot learning setting. data to mimic the unseen test classes. The out-of-the-box data is generated by an attribute-guided generative model using the same attribute representation as the test classes. Therefore, the out-of-the-box data has a similar distribution to the test data.

Guided by the performance of ZSL model on the out-of-the-box data, we propose a novel iterative attribute selection (IAS) model to select the key attributes in an iterative manner. Figure 2 illustrates the procedures of the proposed ZSL with iterative attribute selection (ZSLIAS). Unlike the previous ZSLAS that uses training data to select attributes at once, our IAS first generates out-of-the-box data to mimic the unseen classes, and subsequently iteratively selects key attributes based on the generated out-of-the-box data. During the test stage, selected attributes are employed as a more efficient semantic representation to improve the original ZSL model. By adopting the proposed IAS, the improved attribute embedding space is more discriminative for the test data, and hence improves the performance of the original ZSL model.

The main contributions of this paper are summarized as follows:

•

We present a generalization error analysis for ZSL problem. Our theoretical analyses prove that selecting the subset of key attributes can improve the generalization performance of the original ZSL model which utilizes all the attributes.

•

Based on our theoretical findings, we propose a novel iterative attribute selection strategy to select key attributes for ZSL tasks.

•

Since test data is unseen during the training stage for ZSL tasks, we introduce the out-of-the-box data to mimic test data for attribute selection. Such data generated by a designed generative model has a similar distribution to the test data. Therefore, attributes selected based on the out-of-the-box data can be effectively generalized to the unseen test data.

•

Extensive experiments demonstrate that IAS can effectively improve the attribute-based ZSL model and achieve state-of-the-art performance.

The rest of the paper is organized as follows. Section 2 reviews related works. Section 3 gives the preliminary and motivation. Section 4 presents the theoretical analyses on generalization bound for attribute selection. Section 5 proposes the iterative attribute selection model. Experimental results are reported in Section 6. Conclusion is drawn in Section 7.

2 Related Works

In this section, we review some related works on zero-shot learning, attribute selection and deep generative models.

2.1 Zero-shot Learning

ZSL can recognize new objects using attributes as the intermediate semantic representation. Some researchers adopt the probability-prediction strategy to transfer information. Lampert et al. (2013) proposed a popular baseline, i.e. direct attribute prediction (DAP). DAP learns probabilistic attribute classifiers using the seen data and infers the label of the unseen data by combining the results of pre-trained classifiers.

Most recent works adopt the label-embedding strategy that directly learns a mapping function from the input features space to the semantic embedding space. One line of works is to learn linear compatibility functions. For example, Akata et al. (2015) presented an attribute label embedding (ALE) model which learns a compatibility function combined with ranking loss. Romera-Paredes et al. (2015) proposed an approach that models the relationships among features, attributes and classes as a two linear layers network. Another direction is to learn nonlinear compatibility functions. Xian et al. (2016) presented a nonlinear embedding model that augments bilinear compatibility model by incorporating latent variables. Airola et al. (2017) proposed a first general Kronecker product kernel-based learning model for ZSL tasks. In addition to the classification task, Ji et al. (2019) proposed an attribute network for zero-shot hashing retrieval task.

2.2 Attribute Selection

Attributes, as a kind of popular semantic representation of visual objects, can be the appearance, a part or a property of objects (Farhadi et al., 2009). For example, object elephant has the attribute big and long nose, object zebra has the attribute striped. Attributes are widely used to transfer information to recognize new objects in ZSL tasks (Sun et al., 2017; Xu et al., 2019). As shown in Figure 1, using attributes as the semantic representation, data of different categories locates in different boxes bounded by the attributes. Since the attribute representation of the seen classes and the unseen class are different, the boxes with respect to the seen data and the unseen data are disjoint.

In previous ZSL works, all the attributes are assumed to be effective and treated equally. However, as pointed out in Guo et al. (2018), not all the attributes are effective for recognizing new objects. Therefore, we should select the key attributes to improve the semantic presentation. Liu et al. (2014) proposed a novel greedy algorithm which selects attributes based on their discriminating power and reliability. Guo et al. (2018) proposed to select attributes by measuring the distributive entropy and the predictability of attributes based on the training data. In short, previous attribute selection models are conducted based on the training data, which makes the selected attributes have poor generalization capability to the unseen test data. While our IAS iteratively selects attributes based on the out-of-the-box data which has a similar distribution to the test data, and thus the key attributes selected by our model can be more effectively generalized to the unseen test data.

2.3 Attribute-guided Generative Models

Deep generative models (Ma et al., 2017) aim to estimate the joint distribution $p(y;x)$ of samples and labels, by learning the class prior probability $p(y)$ and the class-conditional density $p(x|y)$ separately. The generative model can be extended to a conditional generative model if the generator is conditioned on some extra information, such as attributes in the proposed method. Odena et al. (2017) introduced a conditional version of generative adversarial nets, i.e. CGAN, which can be constructed by simply feeding the data label. CGAN is conditioned on both the generator and discriminator and can generate samples conditioned on class labels. Conditional Variational Autoencoder (CVAE) (Sohn et al., 2015), as an extension of Variational Autoencoder, is a deep conditional generative model for structured output prediction using Gaussian latent variables. We modify CVAE with the attribute representation to generate out-of-the-box data for the attribute selection.

3 Preliminary and Motivation

3.1 ZSL Task Formulation

We consider zero-shot learning as a task that recognizes unseen classes which have no labeled samples available. Given a training set $D_{s}=\left\{\left(x_{n},y_{n}\right),n=1,...,N_{s}\right\}$ , the task of traditional ZSL is to learn a mapping $f:\mathcal{X}\rightarrow\mathcal{Y}$ from the image feature space to the label embedding space, by minimizing the following regularized empirical risk:

[TABLE]

where $l\left(\cdot\right)$ is the loss function, which can be square loss $1/2(f(x)-y)^{2}$ , logistic loss $\mathrm{log}(1+\mathrm{exp}(-yf(x)))$ or hinge loss $\mathrm{max}(0,1-yf(x))$ . $\mathbf{W}$ is the parameter of mapping $f$ , and $\Omega\left(\cdot\right)$ is the regularization term.

The mapping function $f$ is defined as follows:

[TABLE]

where the function $F:\mathcal{X}\times\mathcal{Y}\rightarrow\mathcal{R}$ is the bilinear compatibility function to associate image features and label embeddings defined as follows:

[TABLE]

where $\theta\left(x\right)$ is the image features, $\varphi\left(y\right)$ is the label embedding (i.e. attribute representation).

We summarize some frequently used notations in Table 1.

3.2 Interpretation of ZSL Task

In traditional ZSL models, all the attributes are assumed to be effective and treated equally. While in previous works, some researchers pointed out that not all the attributes are useful and significant for zero-shot classification (Jiang et al., 2017). To the best of our knowledge, there is no theoretical analysis for the generalization performance of ZSL tasks, let alone selecting informative attributes for unseen classes. To fill in this gap, we first derive the generalization error bound for ZSL models.

The intuition of our theoretical analysis is to simply treat the attributes as a kind of error correcting output codes, then the prediction of ZSL tasks can be deemed as the assignment of class labels with respective pre-defined ECOC, which is the closest to the predicted ECOC problem (Rocha et al., 2014). Based on this novel interpretation, we derive a theoretical generalization error bound of ZSL model as shown in Section 4. From the generalization bound analyses, we find that the discriminating power of attributes governs the performance of the ZSL model.

3.3 Deficiency of ZSLAS

Some attribute selection works have been proposed in recent years. Guo et al. (2018) proposed the ZSLAS model that selects attributes based on the distributive entropy and the predictability of attributes using training data. Simultaneously considering the ZSL model loss function and attribute properties in a joint optimization framework, they selected attributes by minimizing the following loss function:

[TABLE]

where $\mathbf{s}$ is the weight vector of the attributes which will be further used for attribute selection. $\theta(\cdot)$ is the attribute classifier, $\varphi(y_{n})$ is the attribute representation, $\mu$ is an auxiliary parameter. $l_{\mathrm{ZSL}}$ is the model based loss function for ZSL, i.e. $l(\cdot)$ as defined in Eq. (1). $l_{p}$ is the attribute prediction loss which can be defined based on specific ZSL models and $l_{v}$ is the loss of variance which measures the distributive entropy of attributes (Guo et al., 2018). After getting the weight vector $\mathbf{s}$ by optimizing Eq. (4), attributes can be selected according to $\mathbf{s}$ and then be used to construct ZSL model.

From our theoretical analyses in Section 4, ZSLAS can improve the original ZSL model to some extent (Guo et al., 2018). However, ZSLAS suffers from a drawback that the attributes are selected based on the training data. Since the training and test classes are disjoint in ZSL tasks, it is difficult to measure the quality and contribution of attributes regarding discriminating the unseen test classes. Thus, the selected attributes by ZSLAS have poor generalization capability to the test data due to the domain shift problem.

3.4 Definition of Out-of-the-box

Since previous attribute selection models are conducted based on the bounded in-the-box data, the selected attributes have poor generalization capability to the test data. However, the test data is unavailable during the training stage. Inspired by learning from pseudo relevance feedback (Miao et al., 2016), we introduce the pseudo data, which is outside the box of the training data, to mimic test classes to guide the attribute selection. Considering that the training data is bounded in the box by attributes, we generate the out-of-the-box data using an attribute-guided generative model. Since the out-of-the-box data is generated based on the same attribute representation as test classes, the box of the generated data will overlap with the box of the test data. And consequently, the key attributes selected by the proposed IAS model based on the out-of-the-box data can be effectively generalized to the unseen test data.

4 Generalization Bound Analysis

In this section, we first derive the generalization error bound of the original ZSL model and then analyze the bound changes after attribute selection. In previous works, some generalization error bounds have been presented for the ZSL task. Romera-Paredes et al. (2015) transformed ZSL problem to the domain adaptation problem and then analyzed the risk bounds for domain adaptation. Stock et al. (2018) considered ZSL problem as a specific setting of pairwise learning and analyzed the bound by the kernel ridge regression model. However, these bound analysis are not suitable for ZSL model due to their assumptions. In this work, we derive the generalization bound from the perspective of ECOC model, which is more similar to the ZSL problem.

4.1 Generalization Error Bound of ZSL

Zero-shot classification is an effective way to recognize new objects which have no training samples available. The basic framework of ZSL model is using attribute representation as the bridge to transfer knowledge from seen objects to unseen objects. To simplify the analysis, we consider ZSL as a multi-class classification problem. Therefore, ZSL task can be addressed via an ensemble method which combines many binary attribute classifiers. Specifically, we pre-trained a binary classifier for each attribute separately in the training stage. To classify a new sample, all the attribute classifiers are evaluated to obtain an attribute codeword (a vector in which each element represents the output of an attribute classifier). Then we compare the predicted codeword to the attribute representations of all the test classes to retrieve the label of the test sample.

To analyze the generalization error bound of ZSL, we first define some distances in the attribute space, and then present a proposition of the error correcting ability of attributes.

Definition 1 (Generalized Attribute Distance).

Given the attribute matrix $\mathbf{A}$ for associating labels and attributes, let $\mathbf{a}_{i}$ , $\mathbf{a}_{j}$ denote the attribute representation of label $y_{i}$ and $y_{j}$ in matrix $\mathbf{A}$ with length $N_{a}$ , respectively. Then the generalized attribute distance between $\mathbf{a}_{i}$ and $\mathbf{a}_{j}$ can be defined as

[TABLE]

where $N_{a}$ is the number of attributes, $\mathbf{a}_{i}^{(m)}$ is the $m^{th}$ element in the attribute representation $\mathbf{a}_{i}$ of the label $y_{i}$ . $\Delta(\mathbf{a}_{i}^{(m)},\mathbf{a}_{j}^{(m)})$ is equal to $1$ if $\mathbf{a}_{i}^{(m)}\neq\mathbf{a}_{j}^{(m)}$ , [math] otherwise.

We further define the minimum distance between any two attribute representations in the attribute space.

Definition 2 (Minimum Attribute Distance).

The minimum attribute distance $\tau$ of matrix $\mathbf{A}$ is the minimum distance between any two attribute representations $\mathbf{a}_{i}$ and $\mathbf{a}_{j}$ as follows:

[TABLE]

Given the definition of distance in the attribute space, we can prove the following proposition.

Proposition 1 (Error Correcting Ability (Zhou et al., 2019)).

Given the label-attribute correlation matrix $\mathbf{A}$ and a vector of predicted attribute representation $f(x)$ for an unseen test sample $x$ with known label $y$ . If $x$ is incorrectly classified, then the distance between the predicted attribute representation $f(x)$ and the correct attribute representation $\mathbf{a}_{y}$ is greater than half of the minimum attribute distance $\tau$ , i.e.

[TABLE]

Proof.

Suppose that the predicted attribute representation for test sample $x$ with correct attribute representation $\mathbf{a}_{y}$ is $f(x)$ , and the sample $x$ is incorrectly classified to the mismatched attribute representation $\mathbf{a}_{r}$ , where $r\in{\mathcal{Y}_{u}\setminus\{y\}}$ . Then the distance between $f(x)$ and $\mathbf{a}_{y}$ is greater than the distance between $f(x)$ and $\mathbf{a}_{r}$ , i.e.,

[TABLE]

Here, the distance between attribute representation can be expanded as the element-wise summation based on Eq. (5) as follows:

[TABLE]

Then, we have:

[TABLE]

where $(i)$ follows Eq. (9), $(ii)$ is based on the triangle inequality of distance metric (Zhou et al., 2019) and $(iii)$ follows Eq. (6). ∎

From Proposition 1, we can find that, the predicted attribute representation is not required to be exactly the same as the ground truth for each unseen test sample. As long as the distance is less than ${\tau}/{2}$ , ZSL models can correct the error committed by some attribute classifiers and make an accurate prediction.

Based on the Proposition of error correcting ability of attributes, we can derive the theorem of generalization error bound for ZSL.

Theorem 1 (Generalization Error Bound of ZSL).

Given $N_{a}$ attribute classifiers, $f^{(1)},$ $f^{(2)},...,f^{(N_{a})}$ , trained on training set $D_{s}$ with label-attribute matrix $\mathbf{A}$ , the generalization error rate for the attribute-based ZSL model is upper bounded by

[TABLE]

where $\bar{B}=\frac{1}{N_{a}}\sum_{m=1}^{N_{a}}B_{m}$ and $B_{m}$ is the upper bound of the prediction loss for the $m^{th}$ attribute classifier $f^{(m)}$ .

Proof.

According to Proposition 1, for any incorrectly classified test sample $x$ with label $y$ , the distance between the predicted attribute representation $f(x)$ and the true attribute representation $\mathbf{a}_{y}$ is greater than ${\tau}/{2}$ , i.e.,

[TABLE]

Let $k$ be the number of incorrect image classifications for unseen test dataset $D_{u}=\{(x_{i},y_{i}),i=1,...,N_{u}\}$ , we can obtain:

[TABLE]

where $\bar{B}=\frac{1}{N_{a}}\sum_{m=1}^{N_{a}}B_{m}$ and $B_{m}$ is the upper bound of attribute prediction loss.

Hence, the generalized error rate ${k}/{N_{u}}$ is bounded by ${2N_{a}\bar{B}}/{\tau}$ . ∎

Remark 1 (Generalization error bound is positively correlated to the average attribute prediction loss).

From Theorem 1, we can find that the generalization error bound of the attribute-based ZSL model depends on the number of attributes $N_{a}$ , minimum attribute distance $\tau$ and average prediction loss $\bar{B}$ for all the attribute classifiers. According to the Definition 1 and 2, the minimum attribute distance $\tau$ is positively correlated to the number of attributes $N_{a}$ . Therefore, the generalization error bound is mainly affected by the average prediction loss $\bar{B}$ . Intuitively, the inferior attributes with poor predictability cause greater prediction loss $\bar{B}$ , and consequently, these attributes will have negative effect on the ZSL performance and increase the generalization error rate.**

4.2 Improvement of Generalization after Attribute Selection

It has been proven that the generalization error bound of ZSL model is affected by the average prediction loss $\bar{B}$ in the previous section. In this section, we will prove that attribute selection can reduce the average prediction loss $\bar{B}$ , and consequently reduce the generalization error bound of ZSL from the perspective of PAC-style (Valiant, 1984) analysis.

Lemma 1 (PAC bound of ZSL (Palatucci et al., 2009)).

Given $N_{a}$ attribute classifiers, to obtain an attribute classifier with $(1-\delta)$ probability that has at most $k_{a}$ incorrect predicted attributes, the PAC bound $D$ of the attribute-based ZSL model is:

[TABLE]

where $d$ is the dimension of the image features.

Remark 2 (The average attribute prediction loss is positively correlated to the PAC bound).

Here, $k_{a}/N_{a}$ is the tolerable prediction error rate of attribute classifiers. According to the definition of the average attribute prediction loss $\bar{B}$ , it is obvious that the ZSL model with smaller $\bar{B}$ could tolerate a greater $k_{a}/N_{a}$ . From Lemma 1, we can find that the PAC bound $D$ is monotonically increasing with respect to $N_{a}/k_{a}$ . Hence, the PAC bound $D$ decreases when the $N_{a}/k_{a}$ decreases, and consequently the average prediction loss $\bar{B}$ decreases.**

Lemma 2 (Test Error Bound (Vapnik, 2013)).

Suppose that the PAC bound of the attribute-based ZSL model is $D$ . The probability of the test error distancing from an upper bound is given by:

[TABLE]

where $N_{s}$ is the size of the training set, $0\leq\eta\leq 1$ , and $e_{ts}$ , $e_{tr}$ are the test error and the training error respectively.

Remark 3 (PAC bound is positively correlated to the test error bound).

From Lemma 2, we can find that the PAC bound can affect the probabilistic upper bound on the test error. Specifically, to obtain a high probability with small test error, the PAC bound should be small. In other words, the model with smaller PAC bound would have a smaller test error bound.**

Proposition 2 (Bound Change after Attribute Selection).

For the attribute-based ZSL model, attribute selection can decrease the generalization error bound.

Proof.

In attribute selection, the key attributes are selected by minimizing the loss function in Eq. (1) on the out-of-the-box data. Since the generated out-of-the-box data has a similar distribution to the test data, the test error of ZSL will decrease after attribute selection, i.e. ZSLIAS has a smaller test error bound than the original ZSL model. Therefore, we can infer that ZSLIAS has a smaller PAC bound based on Remark 3. According to Remark 2, we can infer that the average prediction error $\bar{B}$ decreases after attribute selection. As a consequence, the generalization error bound of ZSLIAS is smaller than the original ZSL model based on Remark 1. ∎

From Proposition 2, we can observe that the generalization error of ZSL model will decrease after adopting the proposed IAS. In other words, ZSLIAS have a smaller classification error rate comparing to the original ZSL method when generalizing to the unseen test data.

5 IAS with Out-of-the-box Data

Motivated by the generalization bound analyses, we select the key attributes based on the out-of-the-box data. In this section, we first present the proposed iterative attribute selection model. Then, we introduce the attribute-guided generative model designed to generate the out-of-the-box data. The complexity analysis of IAS is given at last.

5.1 Iterative Attribute Selection Model

Inspired by the idea of iterative machine teaching (Liu et al., 2017), we propose a novel iterative attribute selection model that iteratively selects attributes based on the generated out-of-the-box data. Firstly, we generate the out-of-the-box data to mimic test classes by an attribute-based generative model. Then, the key attributes are selected in an iterative manner based on the out-of-the-box data. After obtaining the selected attributes, we can consider them as a more efficient semantic representation to improve the original ZSL model.

Suppose given the generated out-of-the-box data $D_{g}=\{(x_{n},y_{n}),n=1,...,N_{g}\}$ , we can combine the empirical risk in Eq. (1) with the attribute selection model. Then the loss function is rewritten as follows:

[TABLE]

where $\mathbf{s}\in\left(0,1\right)^{N_{a}}$ is the indicator vector for the attribute selection, in which $s_{i}=1$ if the $i^{th}$ attribute is selected or [math] otherwise. ${N_{a}}$ is the number of all the attributes.

Correspondingly, the mapping function $f$ in Eq. (2) and the compatibility function $F$ in Eq. (3) can be rewritten as follows:

[TABLE]

where $\circ$ is element-wise product operator (Hadamard product), $\mathbf{s}$ is the selection vector defined in Eq. (16).

To solve the optimization problem in Eq. (16), we need to specify the choice of the loss function $l\left(\cdot\right)$ . The loss function in Eq. (16) for single sample $(x_{n},y_{n})$ is expressed as follows (Xian et al., 2018):

[TABLE]

where $\mathcal{Y}_{g}$ is the label of generated out-of-the-box data, which is the same as $\mathcal{Y}_{u}$ .

$\triangle(y_{n};y)=0$ if $y_{n}=y$ , 1 otherwise. $r_{ny}\in[0,1]$ is the weight defined in specific ZSL methods.

Since the dimension of the optimal attribute subset (i.e. $l_{0}$ -norm of $\mathbf{s}$ ) is agnostic, finding the optimal $\mathbf{s}$ is a NP-Complete (Garey et al., 1974) problem. Therefore, inspired by the idea of iterative machine teaching (Liu et al., 2017), we adopt the greedy algorithm (Cormen et al., 2009) to optimize the loss function in an iterative manner. Eq. (16) gets updated during each iteration as follows:

[TABLE]

The constraints on $\mathbf{s}$ ensure that $\mathbf{s}^{t}$ updates one element (from 0 updates to 1) during each iteration, which indicates that only one attribute is selected each time. $\mathbf{s}^{0}$ is the initial vector of all 0’s.

Correspondingly, the loss function in Eq. (20) for single sample $(x_{n},y_{n})$ gets updated during each iteration as follows:

[TABLE]

Here $l^{t+1}$ subjects to the same constrains as Eq. (20).

To minimize the loss function in Eq. (20), we can alternatively optimize $\mathbf{W}^{t+1}$ and $\mathbf{s}^{t+1}$ by optimizing one variable while fixing the other one. In each iteration, we firstly optimize $\mathbf{W}^{t+1}$ via the gradient descent algorithm (Burges et al., 2005). The gradient of Eq. (20) is calculated as follows:

[TABLE]

where

[TABLE]

where $\alpha$ is the regularization parameter.

After updating $\mathbf{W}^{t+1}$ , we can traverse all the elements equal to [math] in $\mathbf{s}^{t}$ , and turn them into 1 respectively. Then $\mathbf{s}^{t+1}$ is updated by the optimal $\mathbf{s}^{t+1}$ which achieves the minimal loss of Eq. (20):

[TABLE]

When iterations end and $\mathbf{s}$ is obtained, we can easily get the subset of key attributes by selecting the attributes corresponding to the elements equal to 1 in the selection vector $\mathbf{s}$ .

The procedure of the proposed IAS model is given in Algorithm 1.

5.2 Generation of Out-of-the-box Data

In order to select the discriminative attributes for test classes, we should do attribute selection on the test data. Since the training data and the test data are located in the different boxes bounded by the attributes, we adopt an attribute-based generative model (Bucher et al., 2017) to generate out-of-the-box data to mimic test classes. Comparing to the ZSLAS, the key attributes selected by IAS based on the out-of-the-box data can be more efficiently generalized to test data.

Conditional variational autoencoder (CVAE) (Sohn et al., 2015) is a conditional generative model in which the latent codes and generated data are both conditioned on some extra information. In this work, we propose the attribute-based variational autoencoder (AVAE), a special version of CVAE with tailor-made attributes, to generate the out-of-the-box data.

VAE (Kingma et al., 2013) is a directed graphical model with certain types of latent variables. The generative process of VAE is as follows: a set of latent codes $z$ is generated from the prior distribution $p(z)$ , and the data $x$ is generated by the generative distribution $p(x|z)$ conditioned on $z:z\sim p(z)$ , $x\sim p(x|z)$ . The empirical objective of VAE is expressed as follows (Sohn et al., 2015):

[TABLE]

where $z^{(l)}=g(x,\epsilon^{(l)})$ , $\epsilon^{(l)}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ . $q(z|x)$ is the recognition distribution which is reparameterized with a deterministic and differentiable function $g(\cdot,\cdot)$ (Sohn et al., 2015) . $\mathrm{KL}$ denotes the Kullback-Leibler divergence (Kullback, 1987) between the incorporated distributions. $L$ is the number of samples.

Combining with the condition, i.e. the attribute representation of labels, the empirical objective of the AVAE is defined as follows:

[TABLE]

where $z^{(l)}=g(x,\varphi(y),\epsilon^{(l)})$ , $\varphi\left(y\right)$ is the attribute representation of label $y$ .

In the encoding stage, for each training data point $x^{(i)}$ , we estimate the $q(z^{(i)}|x^{(i)},\\ \varphi(y^{(i)}))=Q(z)$ using the encoder. In the decoding stage, after inputting the concatenation of the $\tilde{z}$ sampled from the $Q(z)$ and the attribute representation $\varphi(y_{u})$ , the decoder will generate a new sample $x_{g}$ with the same attribute representation as the unseen class $\varphi(y_{u})$ .

The procedure of AVAE is illustrated in Figure 3. At training time, the attribute representation (of training classes) whose image is being fed in is provided to the encoder and decoder. To generate an image of a particular attribute representation (of test classes), we can just feed this attribute vector along with a random point in the latent space sampled from a standard normal distribution. The system no longer relies on the latent space to encode what object you are dealing with. Instead, the latent space encodes attribute information. Since the attribute representations of test classes are fed into the decoder at generating stage, the generated out-of-the-box data $D_{g}$ has a similar distribution to the test data.

5.3 Complexity Analysis

Suppose that there are $N_{u}$ unseen samples belonging to $L$ test classes, and the number of all the attributes is $N_{a}$ . The complexity of original ZSL model is $\mathcal{O}_{\mathrm{ZSL}}\sim\mathcal{O}(N_{u}N_{a}L^{2})$ . For the proposed ZSLIAS, the complexity of training stage is $\mathcal{O}_{\mathrm{ZSLIAS}}\sim N_{a}(N_{a}+1)/2\cdot\mathcal{O}_{\mathrm{ZSL}}$ , i.e. $\mathcal{O}(N_{u}N_{a}^{3}L^{2})$ , and the complexity of test stage is equal to $\mathcal{O}_{\mathrm{ZSL}}$ , i.e. $\mathcal{O}(N_{u}N_{a}L^{2})$ .

6 Experiments

To evaluate the performance of the proposed iterative attribute selection model, extensive experiments are conducted on four standard datasets with ZSL setting. In this section, we first compare the proposed approach with the state-of-the-art, and then give detailed analyses.

6.1 Experimental Settings

6.1.1 Dataset

We conduct experiments on four standard ZSL datasets: (1) Animal with Attribute (AwA) (Lampert et al., 2013), (2) attribute-Pascal-Yahoo (aPY) (Farhadi et al., 2009), (3) Caltech-UCSD Bird 200-2011 (CUB) (Wah et al., 2011), and (4) SUN Attribute Database (SUN) (Patterson et al., 2012). The overall statistic information of these datasets is summarized in Table 2.

6.1.2 Dataset Split

Zero-shot learning assumes that training classes and test classes are disjoint. Actually, ImageNet, the dataset exploited to extract image features via deep neural networks, may include some test classes. Therefore, Xian et al. (2018) proposed a new dataset split (PS) ensuring that none of the test classes appears in the dataset used to train the extractor model. In this paper, we evaluate the proposed model using both splits, i.e., the original standard split (SS) and the proposed split (PS).

6.1.3 Image Feature

Deep neural network feature is extracted for the experiments. Image features are extracted from the entire images for AwA, CUB and SUN datasets, and from bounding boxes mentioned in Farhadi et al. (2009) for aPY dataset, respectively. The original ResNet-101 (He et al., 2016) pre-trained on ImageNet with 1K classes is used to calculate 2048-dimensional top-layer pooling units as image features.

6.1.4 Attribute Representation

Attributes are used as the semantic representation to transfer information from training classes to test classes. We use 85, 64, 312 and 102-dimensional continuous value attributes for AwA, aPY, CUB and SUN datasets, respectively.

6.1.5 Evaluation protocol

Unified dataset splits shown in Table 2 are used for all the compared methods to get fair comparison results. Since the dataset is not well balanced with respect to the number of images per class (Xian et al., 2018), we use the mean class accuracy, i.e. per-class averaged top-1 accuracy, as the criterion of assessment. Mean class accuracy is calculated as follows:

[TABLE]

where $L$ is the number of test classes, $\mathcal{Y}_{u}$ is the set comprised of all the test labels.

6.2 Comparison with the State-of-the-Art

To evaluate the efficiency of the proposed iterative attribute selection model, we modify several latest ZSL baselines by the proposed IAS and compare them with the state-of-the-art.

We modify seven representative ZSL baselines to evaluate the IAS model, including three popular ZSL baselines (i.e. DAP (Lampert et al., 2013), LatEm (Xian et al., 2016) and SAE (Kodirov et al., 2017)) and four latest ZSL baselines (i.e. MFMR (Xu et al., 2017), GANZrl (Tong et al., 2018), fVG (Xian et al., 2019) and LLAE (Li et al., 2019)).

The improvement achieved on these ZSL baselines is summarized in Table 3. It can be observed that IAS can significantly improve the performance of attribute-based ZSL methods. Specifically, the mean accuracies of these ZSL methods on four datasets (i.e. AwA, aPY, CUB and SUN) are increased by $11.09\%$ , $15.97\%$ , $9.10\%$ , $5.11\%$ , respectively ( $10.29\%$ on average) after using IAS. For DAP on AwA and aPY datasets, LatEm on AwA dataset, IAS can improve their accuracy by greater than $20\%$ , which demonstrates that IAS can significantly improve the performance of ZSL models. Interestingly, SAE performs badly on aPY and CUB datasets, while the accuracy rises to an acceptable level (from $8.33\%$ to $38.53\%$ , and from $24.65\%$ to $42.85\%$ , respectively) by using IAS. Even though the performance of state-of-the-art baselines is pretty well, IAS can still improve them to some extent ( $5.48\%$ , $3.24\%$ , $2.80\%$ and $3.64\%$ on average for MFMR, GANZrl, fVG and LLAE respectively). These results demonstrate that the proposed iterative attribute selection model makes sense and can effectively improve existing attribute-based ZSL methods. This also proves the necessity and effectiveness of attribute selection for ZSL tasks.

As a similar work to ours, ZSLAS selects attributes based on the distributive entropy and the predictability of attributes. Thus, we compare the improvement of IAS and ZSLAS on DAP and LatEm, respectively. In Table 3, it can be observed that ZSLAS can improve existing ZSL methods, while IAS can improve them by a greater level ( $2.15\%$ vs $10.61\%$ on average). Compared to ZSLAS, the advantages of ZSLIAS can be interpreted in two aspects. Firstly, ZSLIAS selects attributes in an iterative manner, hence it can select a more optimal subset of key attributes than ZSLAS that selects attributes at once. Secondly, ZSLAS is conducted based on the training data, while ZSLIAS is conducted based on the out-of-the-box data which has a similar distribution to the test data. Therefore, attributes selected by ZSLIAS is more applicable and discriminative for test data. Experimental results demonstrate the significant superiority of the proposed IAS model over previous attribute selection models.

6.3 Detailed Analysis

In order to further understand the promising performance, we analyze the following experimental results in detail.

6.3.1 Evaluation on the Out-of-the-box Data

In the first experiment, we evaluate the out-of-the-box data generated by a tailor-made attribute-based deep generative model. Figure 4 shows the distribution of the out-of-the-box data and the real test data sampled from AwA dataset using t-SNE. Note that the out-of-the-box data in Figure 4(b) is generated only based on the attribute representation of unseen classes, and without extra information of any test images. It can be observed that the generated out-of-the-box data can capture a similar distribution to the real test data, which guarantees that the selected attributes can be effectively generalized to test data.

We also quantitatively evaluate the out-of-the-box data by calculating various distances between three distributions, i.e. the generated out-of-the-box data ( $\mathcal{X}_{g}$ ), unseen test data ( $\mathcal{X}_{u}$ ) and seen training data ( $\mathcal{X}_{s}$ ), in pairs. Table 4 shows the distribution distances measured by Wasserstein Distance (Vallender, 1974), KL Divergence (Kullback, 1987), Hellinger Distance (Beran, 1977) and Bhattacharyya Distance (Kailath, 1967), respectively. It is obvious that the distance between $\mathcal{X}_{g}$ and $\mathcal{X}_{u}$ is much less than the distance between $\mathcal{X}_{u}$ and $\mathcal{X}_{s}$ , which means that the generated out-of-the-box data has a similar distribution to the unseen test data compared to the seen data. Therefore, attributes selected based on the out-of-the-box data are more discriminative for test data comparing to attributes selected based on training data.

We illustrate some generated images of unseen classes (i.e. panda and seal) and annotate them the corresponding attribute representations as shown in Figure 5. Numbers in black indicate the attribute representations of the labels of real test images. Numbers in red and green are the correct and the incorrect attribute values of generated images, respectively. We can see that the generated images have the similar attribute representation as test images. Therefore, the tailor-made attribute-based deep generative model can generate the out-of-the-box data which captures a similar distribution to the unseen data.

6.3.2 Effectiveness of IAS

In the second experiment, we compare the performance of three ZSL methods (i.e. DAP, LatEm and SAE) after using IAS on four datasets, respectively. The accuracies with respect to the number of selected attributes are shown in Figure 6. On AwA, aPY and SUN datasets, we can see that the performance of these three ZSL methods increases sharply when the number of selected attributes grows from [math] to about $20\%$ , and then reaches the peak. These results suggest that only about a quarter of attributes are the key attributes which are necessary and effective to classify test objects. In Figure 6(b) and 6(f), there is an interesting result that SAE performs badly on aPY dataset with both SS and PS (the accuracy is less than $10\%$ ), while the performance is acceptable after using IAS (the accuracy is about $40\%$ ). These results demonstrate the effectiveness and robustness of IAS for ZSL tasks.

Furthermore, we modify DAP by using all the attributes ( $\#84$ ), using the selected attributes ( $\#20$ ) and using the remaining attributes ( $\#64$ ) after attribute selection, respectively. The resulting confusion matrices of these three variants evaluated on AwA dataset with proposed split setting are illustrated in Figure 7. The numbers in the diagonal area (yellow patches) of confusion matrices indicate the classification accuracy per class. It is obvious that IAS can significantly improve DAP performance on most of the test classes, and the accuracies on some classes nearly doubled after using IAS, such as horse, seal, and giraffe. Even though some objects are hard to be recognized by DAP, like dolphin (the accuracy of DAP is $1.6\%$ ), we can get an acceptable performance after using IAS (the accuracy of DAPIAS is $72.7\%$ ). The original DAP only performs better than IAS with regard to the object blue whale, this is because in the original DAP, most of the marine creatures (such as blue whale, walrus and dolphin) are classified as the blue whale, which increases the classification accuracy while also increasing the false positive rate. More importantly, the confusion matrix of DAPIAS contains less noise (i.e. smaller numbers in the side regions (white patches) of confusion matrices apart from the diagonal area) than DAP, which suggests that DAPIAS has less prediction uncertainties. In other words, adopting IAS can improve the robustness of attribute-based ZSL methods.

In Figure 7, the accuracy of using the selected attributes ( $71.88\%$ on average) is significantly improved comparing to the accuracy of using all the attributes ( $46.23\%$ on average), and the accuracy of using the remaining attributes ( $31.32\%$ on average) is extremely terrible. These results suggest that the selected attributes are the key attributes for discriminating test data. The missing attributes are useless and even have a negative impact on the ZSL system. Therefore, it is obvious that not all the attributes are effective for ZSL tasks, and we should select the key attributes to improve performance.

6.3.3 Interpretability of Selected Attributes

In the third experiment, we present the visualization results of attribute selection. We find that ZSL methods obtain the best performance when selecting about $20\%$ attributes as shown in Figure 6. Therefore, we illustrate the top $20\%$ key attributes selected by DAP, LatEm and SAE on four datasets in Figure 8. Three rows in each figure are DAP, LatEm and SAE from top to bottom, and yellow bars indicate the attributes which are selected by the corresponding methods. We can see that the attribute subsets selected by different ZSL methods are highly coincident for the same dataset, which demonstrates that the selected attributes are the key attributes for discriminating test data. Specifically, we enumerate the key attributes selected by three ZSL methods on AwA dataset in Table 5. Attributes in boldface indicate that they are simultaneously selected by all the three ZSL methods, and attributes in italics indicate that they are selected by any two of these three methods. It can be observed that 13 attributes ( $65\%$ ) are selected by all the three ZSL methods. These three attribute subsets selected by diverse ZSL models are very similar, which is another evidence that IAS is reasonable and useful for zero-shot classification.

7 Conclusion

We present a novel and effective iterative attribute selection model to improve existing attribute-based ZSL methods. In most of the previous ZSL works, all the attributes are assumed to be effective and treated equally. However, we notice that attributes have different predictability and discriminability for diverse objects. Motivated by this observation, we propose to select the key attributes to build ZSL model. Since training classes and test classes are disjoint in ZSL tasks, we introduce the out-of-the-box data to mimic test data to guide the progress of attribute selection. The out-of-the-box data generated by a tailor-made attribute-based deep generative model has a similar distribution to the test data. Hence, the attributes selected by IAS based on the out-of-the-box data can be effectively generalized to the test data. To evaluate the effectiveness of IAS, we conduct extensive experiments on four standard ZSL datasets. Experimental results demonstrate that IAS can effectively select the key attributes for ZSL tasks and significantly improve state-of-the-art ZSL methods.

In this work, we select the same attributes for all the unseen test classes. Obviously, this is not the global optimal solution to select attributes for diverse categories. In the future, we will consider a tailor-made attribute selection model that can select the special subset of key attributes for each test class.

Acknowledgments

This work is supported in part by ARC under Grant LP150100671 and Grant DP180100106, in part by NSFC under Grant 61373063 and Grant 61872188, in part by the Project of MIIT under Grant E0310/1112/02-1, in part by the Collaborative Innovation Center of IoT Technology and Intelligent Systems of Minjiang University under Grant IIC1701, and in part by China Scholarship Council.

Bibliography48

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Airola et al. (2017) Airola, A., & Pahikkala, T. (2017). Fast Kronecker product kernel methods via generalized vec trick. IEEE Transactions on Neural Networks and Learning Systems , 29(8) , 3374 – 3387.
2Akata et al. (2015) Akata, Z., Perronnin, F., Harchaoui, Z., & Schmid, C. (2015). Label-embedding for image classification. IEEE Transactions on Pattern Analysis and Machine Intelligence , 38(7) , 1425 – 1438.
3Beran (1977) Beran, R. (1977). Minimum Hellinger distance estimates for parametric models. The Annals of Statistics , 5(3) , 445 – 463.
4Bucher et al. (2017) Bucher, M., Herbin, S., & Jurie, F. (2017). Generating visual representations for zero-shot classification. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2666-2673).
5Burges et al. (2005) Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., & Hullender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05) (pp. 89 – 96).
6Cheng et al. (2017) Cheng, Y., Qiao, X., Wang, X., & Yu, Q. (2017). Random forest classifier for zero-shot learning based on relative attribute. IEEE Transactions on Neural Networks and Learning Systems , 29(5) , 1662 – 1674.
7Cormen et al. (2009) Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to Algorithms . MIT press.
8Dietterich et al. (1994) Dietterich, T. G., & Bakiri, G. (1994). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research , 2) , 263 – 286.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

1 Introduction

2 Related Works

2.1 Zero-shot Learning

2.2 Attribute Selection

2.3 Attribute-guided Generative Models

3 Preliminary and Motivation

3.1 ZSL Task Formulation

3.2 Interpretation of ZSL Task

3.3 Deficiency of ZSLAS

3.4 Definition of Out-of-the-box

4 Generalization Bound Analysis

4.1 Generalization Error Bound of ZSL

Definition 1** (Generalized Attribute Distance).**

Definition 2** (Minimum Attribute Distance).**

Proposition 1** (Error Correcting Ability (Zhou et al., 2019)).**

Proof.

Theorem 1** (Generalization Error Bound of ZSL).**

Proof.

Remark 1** (Generalization error bound is positively correlated to the average attribute prediction loss).**

4.2 Improvement of Generalization after Attribute Selection

Lemma 1** (PAC bound of ZSL (Palatucci et al., 2009)).**

Remark 2** (The average attribute prediction loss is positively correlated to the PAC bound).**

Lemma 2** (Test Error Bound (Vapnik, 2013)).**

Remark 3** (PAC bound is positively correlated to the test error bound).**

Proposition 2** (Bound Change after Attribute Selection).**

Proof.

5 IAS with Out-of-the-box Data

5.1 Iterative Attribute Selection Model

5.2 Generation of Out-of-the-box Data

5.3 Complexity Analysis

6 Experiments

6.1 Experimental Settings

6.1.1 Dataset

6.1.2 Dataset Split

6.1.3 Image Feature

6.1.4 Attribute Representation

6.1.5 Evaluation protocol

6.2 Comparison with the State-of-the-Art

6.3 Detailed Analysis

6.3.1 Evaluation on the Out-of-the-box Data

6.3.2 Effectiveness of IAS

6.3.3 Interpretability of Selected Attributes

7 Conclusion

Acknowledgments

Definition 1 (Generalized Attribute Distance).

Definition 2 (Minimum Attribute Distance).

Proposition 1 (Error Correcting Ability (Zhou et al., 2019)).

Theorem 1 (Generalization Error Bound of ZSL).

Remark 1 (Generalization error bound is positively correlated to the average attribute prediction loss).

Lemma 1 (PAC bound of ZSL (Palatucci et al., 2009)).

Remark 2 (The average attribute prediction loss is positively correlated to the PAC bound).

Lemma 2 (Test Error Bound (Vapnik, 2013)).

Remark 3 (PAC bound is positively correlated to the test error bound).

Proposition 2 (Bound Change after Attribute Selection).