Multi-view Vector-valued Manifold Regularization for Multi-label Image   Classification

Yong Luo; Dacheng Tao; Chang Xu; Chao Xu; Hong Liu; Yonggang Wen

arXiv:1904.03921·stat.ML·April 9, 2019

Multi-view Vector-valued Manifold Regularization for Multi-label Image Classification

Yong Luo, Dacheng Tao, Chang Xu, Chao Xu, Hong Liu, Yonggang Wen

PDF

TL;DR

This paper introduces a novel multi-view vector-valued manifold regularization method that effectively integrates multiple features and label relationships for improved multi-label image classification.

Contribution

It proposes MV³MR, a new framework that leverages matrix-valued kernels to exploit feature complementarity and label structure in multi-label image classification.

Findings

01

MV³MR outperforms existing methods on PASCAL VOC'07 and MIR Flickr datasets.

02

The method effectively captures the intrinsic local geometry of multi-view data.

03

Experimental results demonstrate significant accuracy improvements.

Abstract

In computer vision, image datasets used for classification are naturally associated with multiple labels and comprised of multiple views, because each image may contain several objects (e.g. pedestrian, bicycle and tree) and is properly characterized by multiple visual features (e.g. color, texture and shape). Currently available tools ignore either the label relationship or the view complementary. Motivated by the success of the vector-valued function that constructs matrix-valued kernels to explore the multi-label structure in the output space, we introduce multi-view vector-valued manifold regularization (MV $^{3}$ MR) to integrate multiple features. MV $^{3}$ MR exploits the complementary property of different features and discovers the intrinsic local geometry of the compact support shared by different features under the theme of manifold regularization. We conducted…

Figures17

Click any figure to enlarge with its caption.

Tables1

Table 1. TABLE I: Performance evaluation on the two datasets

	VOC mAP $↑$ vs. #{labeled samples}			MIR mAP $↑$ vs. #{labeled samples}
Methods	100	200	500	100	200	500	Ranks
SVM_CAT	0.241 $\pm$ 0.011 (7)	0.288 $\pm$ 0.013 (7)	0.371 $\pm$ 0.007 (7)	0.281 $\pm$ 0.009 (7)	0.306 $\pm$ 0.007 (7)	0.352 $\pm$ 0.008 (7)	7
SVM_UNI	0.347 $\pm$ 0.018 (4.5)	0.424 $\pm$ 0.014 (5)	0.529 $\pm$ 0.006 (5)	0.302 $\pm$ 0.011 (5)	0.336 $\pm$ 0.013 (6)	0.400 $\pm$ 0.009 (6)	5.25
MLCS [18]	0.332 $\pm$ 0.017 (6)	0.412 $\pm$ 0.016 (6)	0.525 $\pm$ 0.007 (6)	0.289 $\pm$ 0.010 (6)	0.342 $\pm$ 0.011 (5)	0.424 $\pm$ 0.010 (5)	5.67
KLS_CCA [12]	0.347 $\pm$ 0.019 (4.5)	0.432 $\pm$ 0.014 (4)	0.536 $\pm$ 0.007 (4)	0.321 $\pm$ 0.009 (3.5)	0.369 $\pm$ 0.017 (2)	0.445 $\pm$ 0.009 (2.5)	3.42
MV³LSVM	0.412 $\pm$ 0.025 (1)	0.476 $\pm$ 0.015 (1)	0.555 $\pm$ 0.006 (1)	0.332 $\pm$ 0.013 (1)	0.376 $\pm$ 0.017 (1)	0.449 $\pm$ 0.008 (1)	1
SimpleMKL [16]	0.381 $\pm$ 0.024 (3)	0.453 $\pm$ 0.020 (3)	0.538 $\pm$ 0.011 (3)	0.321 $\pm$ 0.014 (3.5)	0.365 $\pm$ 0.017 (4)	0.444 $\pm$ 0.011 (2.5)	3.17
LpMKL [17]	0.391 $\pm$ 0.024 (2)	0.462 $\pm$ 0.012 (2)	0.540 $\pm$ 0.006 (2)	0.327 $\pm$ 0.010 (2)	0.367 $\pm$ 0.014 (3)	0.436 $\pm$ 0.008 (4)	2.5
	VOC mAUC $↑$ vs. #{labeled samples}			MIR mAUC $↑$ vs. #{labeled samples}
Methods	100	200	500	100	200	500	Ranks
SVM_CAT	0.744 $\pm$ 0.013 (7)	0.785 $\pm$ 0.006 (7)	0.832 $\pm$ 0.003 (7)	0.722 $\pm$ 0.008 (4)	0.745 $\pm$ 0.004 (6)	0.783 $\pm$ 0.004 (6)	6.17
SVM_UNI	0.783 $\pm$ 0.008 (3)	0.824 $\pm$ 0.009 (2.5)	0.870 $\pm$ 0.003 (2.5)	0.718 $\pm$ 0.011 (5)	0.742 $\pm$ 0.011 (7)	0.782 $\pm$ 0.006 (7)	4.5
MLCS [18]	0.773 $\pm$ 0.010 (5)	0.819 $\pm$ 0.010 (6)	0.869 $\pm$ 0.004 (4)	0.701 $\pm$ 0.012 (7)	0.749 $\pm$ 0.010 (5)	0.805 $\pm$ 0.005 (1.5)	4.42
KLS_CCA [12]	0.781 $\pm$ 0.009 (4)	0.824 $\pm$ 0.008 (2.5)	0.866 $\pm$ 0.003 (5)	0.737 $\pm$ 0.009 (2)	0.769 $\pm$ 0.010 (1.5)	0.805 $\pm$ 0.005 (1.5)	2.75
MV³LSVM	0.801 $\pm$ 0.011 (1)	0.835 $\pm$ 0.011 (1)	0.875 $\pm$ 0.004 (1)	0.741 $\pm$ 0.014 (1)	0.769 $\pm$ 0.012 (1.5)	0.802 $\pm$ 0.005 (3.5)	1.5
SimpleMKL [16]	0.769 $\pm$ 0.017 (6)	0.822 $\pm$ 0.013 (4.5)	0.870 $\pm$ 0.006 (2.5)	0.717 $\pm$ 0.013 (6)	0.753 $\pm$ 0.010 (4)	0.802 $\pm$ 0.005 (3.5)	4.42
LpMKL [17]	0.786 $\pm$ 0.008 (2)	0.822 $\pm$ 0.008 (4.5)	0.862 $\pm$ 0.005 (6)	0.732 $\pm$ 0.010 (3)	0.756 $\pm$ 0.010 (3)	0.795 $\pm$ 0.007 (5)	3.92
	VOC RL $↓$ vs. #{labeled samples}			MIR RL $↓$ vs. #{labeled samples}
Methods	100	200	500	100	200	500	Ranks
SVM_CAT	0.220 $\pm$ 0.008 (7)	0.183 $\pm$ 0.006 (7)	0.142 $\pm$ 0.003 (7)	0.165 $\pm$ 0.005 (3)	0.146 $\pm$ 0.004 (5)	0.126 $\pm$ 0.002 (5)	5.67
SVM_UNI	0.178 $\pm$ 0.008 (2)	0.143 $\pm$ 0.006 (3)	0.106 $\pm$ 0.003 (1.5)	0.549 $\pm$ 0.040 (7)	0.437 $\pm$ 0.022 (7)	0.177 $\pm$ 0.011 (7)	4.58
MLCS [18]	0.195 $\pm$ 0.007 (5)	0.155 $\pm$ 0.007 (5)	0.112 $\pm$ 0.004 (4)	0.173 $\pm$ 0.006 (5)	0.145 $\pm$ 0.005 (4)	0.115 $\pm$ 0.003 (2.5)	4.25
KLS_CCA [12]	0.183 $\pm$ 0.008 (3)	0.149 $\pm$ 0.006 (4)	0.122 $\pm$ 0.005 (5)	0.168 $\pm$ 0.013 (4)	0.143 $\pm$ 0.005 (3)	0.121 $\pm$ 0.004 (4)	3.83
MV³LSVM	0.170 $\pm$ 0.007 (1)	0.140 $\pm$ 0.007 (1)	0.108 $\pm$ 0.003 (3)	0.150 $\pm$ 0.006 (1)	0.130 $\pm$ 0.007 (1)	0.111 $\pm$ 0.003 (1)	1.33
SimpleMKL [16]	0.214 $\pm$ 0.017 (6)	0.142 $\pm$ 0.009 (2)	0.106 $\pm$ 0.004 (1.5)	0.155 $\pm$ 0.005 (2)	0.136 $\pm$ 0.006 (2)	0.115 $\pm$ 0.003 (2.5)	2.67
LpMKL [17]	0.186 $\pm$ 0.011 (4)	0.164 $\pm$ 0.010 (6)	0.137 $\pm$ 0.007 (6)	0.199 $\pm$ 0.014 (6)	0.181 $\pm$ 0.007 (6)	0.141 $\pm$ 0.004 (6)	5.67

Equations56

argmin_{f \in H_{k}} \frac{1}{l} i = 1 \sum l L (f, x_{i}, y_{i}) + γ_{A} ∥ f ∥_{k}^{2} + γ_{I} ∥ f ∥_{I}^{2}

argmin_{f \in H_{k}} \frac{1}{l} i = 1 \sum l L (f, x_{i}, y_{i}) + γ_{A} ∥ f ∥_{k}^{2} + γ_{I} ∥ f ∥_{I}^{2}

argmin_{f \in H_{K}} \frac{1}{l} i = 1 \sum l L (f, x_{i}, y_{i}) + γ_{A} ∥ f ∥_{K}^{2} + γ_{I} ⟨ f, M f ⟩_{Y^{u + l}}

argmin_{f \in H_{K}} \frac{1}{l} i = 1 \sum l L (f, x_{i}, y_{i}) + γ_{A} ∥ f ∥_{K}^{2} + γ_{I} ⟨ f, M f ⟩_{Y^{u + l}}

⟨(y_{1}, \dots, y_{u + l}), (w_{1}, \dots, w_{u + l}) ⟩_{Y^{u + l}} = i = 1 \sum u + l ⟨ y_{i}, w_{i} ⟩_{Y}

⟨(y_{1}, \dots, y_{u + l}), (w_{1}, \dots, w_{u + l}) ⟩_{Y^{u + l}} = i = 1 \sum u + l ⟨ y_{i}, w_{i} ⟩_{Y}

K (x_{i}, x_{j}) = k (x_{i}, x_{j}) (γ_{O} L_{o u t}^{†} + (1 - γ_{O}) I_{n})

K (x_{i}, x_{j}) = k (x_{i}, x_{j}) (γ_{O} L_{o u t}^{†} + (1 - γ_{O}) I_{n})

- \frac{1}{l γ _{A}} (J_{l}^{N} G^{k} + l γ_{I} L G^{k}) A Q - A + \frac{1}{l γ _{A}} Y = 0,

- \frac{1}{l γ _{A}} (J_{l}^{N} G^{k} + l γ_{I} L G^{k}) A Q - A + \frac{1}{l γ _{A}} Y = 0,

k (x, x^{'}) = v = 1 \sum V β_{v} k_{v} (x, x^{'})

k (x, x^{'}) = v = 1 \sum V β_{v} k_{v} (x, x^{'})

argmin_{f \in H_{K}} s.t. \frac{1}{l} i = 1 \sum l L (f, x_{i}, y_{i}) + γ_{A} ∥ f ∥_{K}^{2} + γ_{I} ⟨ f, M f ⟩_{Y^{u + l}} + γ_{B} ∥ β ∥_{2}^{2} + γ_{C} ∥ θ ∥_{2}^{2} v = 1 \sum V β_{v} = 1, β_{v} \geq 0, v = 1 \sum V θ_{v} = 1, θ_{v} \geq 0, v = 1, \dots, V,

argmin_{f \in H_{K}} s.t. \frac{1}{l} i = 1 \sum l L (f, x_{i}, y_{i}) + γ_{A} ∥ f ∥_{K}^{2} + γ_{I} ⟨ f, M f ⟩_{Y^{u + l}} + γ_{B} ∥ β ∥_{2}^{2} + γ_{C} ∥ θ ∥_{2}^{2} v = 1 \sum V β_{v} = 1, β_{v} \geq 0, v = 1 \sum V θ_{v} = 1, θ_{v} \geq 0, v = 1, \dots, V,

f^{*} (x) = i = 1 \sum u + l K (x, x_{i}) a_{i},

f^{*} (x) = i = 1 \sum u + l K (x, x_{i}) a_{i},

argmin_{f \in H_{K}, β, θ} s.t. \frac{1}{n l} i = 1 \sum l j = 1 \sum n (1 - y_{ij} f_{j} (x_{i}))_{+} + γ_{A} ∥ f ∥_{K}^{2} + γ_{I} ⟨ f, M f ⟩_{Y^{u + l}} + γ_{B} ∥ β ∥_{2}^{2} + γ_{C} ∥ θ ∥_{2}^{2} v = 1 \sum V β_{v} = 1, β_{v} \geq 0, and v = 1 \sum V θ_{v} = 1, θ_{v} \geq 0, \forall v,

argmin_{f \in H_{K}, β, θ} s.t. \frac{1}{n l} i = 1 \sum l j = 1 \sum n (1 - y_{ij} f_{j} (x_{i}))_{+} + γ_{A} ∥ f ∥_{K}^{2} + γ_{I} ⟨ f, M f ⟩_{Y^{u + l}} + γ_{B} ∥ β ∥_{2}^{2} + γ_{C} ∥ θ ∥_{2}^{2} v = 1 \sum V β_{v} = 1, β_{v} \geq 0, and v = 1 \sum V θ_{v} = 1, θ_{v} \geq 0, \forall v,

argmin_{a, b, ξ, β, θ} s.t. \frac{1}{n l} i = 1 \sum l j = 1 \sum n ξ_{ij} + γ_{A} a^{T} G a + γ_{I} a^{T} G M G a + γ_{B} ∥ β ∥_{2}^{2} + γ_{C} ∥ θ ∥_{2}^{2} y_{ij} (z = 1 \sum l + u K_{j} (x_{i}, x_{z}) a_{z} + b_{j}) \geq 1 - ξ_{ij}, ξ_{ij} \geq 0, \forall i, j v = 1 \sum V β_{v} = 1, β_{v} \geq 0, and v = 1 \sum V θ_{v} = 1, θ_{v} \geq 0, \forall v,

argmin_{a, b, ξ, β, θ} s.t. \frac{1}{n l} i = 1 \sum l j = 1 \sum n ξ_{ij} + γ_{A} a^{T} G a + γ_{I} a^{T} G M G a + γ_{B} ∥ β ∥_{2}^{2} + γ_{C} ∥ θ ∥_{2}^{2} y_{ij} (z = 1 \sum l + u K_{j} (x_{i}, x_{z}) a_{z} + b_{j}) \geq 1 - ξ_{ij}, ξ_{ij} \geq 0, \forall i, j v = 1 \sum V β_{v} = 1, β_{v} \geq 0, and v = 1 \sum V θ_{v} = 1, θ_{v} \geq 0, \forall v,

min F (β, θ) s.t. v = 1 \sum V β_{v} = 1, β_{v} \geq 0, and v = 1 \sum V θ_{v} = 1, θ_{v} \geq 0, \forall v .

min F (β, θ) s.t. v = 1 \sum V β_{v} = 1, β_{v} \geq 0, and v = 1 \sum V θ_{v} = 1, θ_{v} \geq 0, \forall v .

argmin_{a, b, ξ} s.t. \frac{1}{n l} i = 1 \sum l j = 1 \sum n ξ_{ij} + γ_{A} a^{T} G a + γ_{I} a^{T} G M G a + γ_{B} ∥ β ∥_{2}^{2} + γ_{C} ∥ θ ∥_{2}^{2} y_{ij} (z = 1 \sum l + u K_{j} (x_{i}, x_{z}) a_{z} + b_{j}) \geq 1 - ξ_{ij}, ξ_{ij} \geq 0, i = 1, \dots, l, j = 1, \dots, n,

argmin_{a, b, ξ} s.t. \frac{1}{n l} i = 1 \sum l j = 1 \sum n ξ_{ij} + γ_{A} a^{T} G a + γ_{I} a^{T} G M G a + γ_{B} ∥ β ∥_{2}^{2} + γ_{C} ∥ θ ∥_{2}^{2} y_{ij} (z = 1 \sum l + u K_{j} (x_{i}, x_{z}) a_{z} + b_{j}) \geq 1 - ξ_{ij}, ξ_{ij} \geq 0, i = 1, \dots, l, j = 1, \dots, n,

\begin{split}&W(\mathbf{a},\xi,b,\mu,\eta)\\ &=\frac{1}{nl}\sum_{i=1}^{l}\sum_{j=1}^{n}\xi_{ij}+\frac{1}{2}\mathbf{a}^{T}(2\gamma_{A}G+2\gamma_{I}G\mathcal{M}G)\mathbf{a}-\sum_{i=1}^{l}\sum_{j=1}^{n}\eta_{ij}\xi_{ij}\\ &-\sum_{i=1}^{l}\sum_{j=1}^{n}\mu_{ij}\Big{(}y_{ij}\big{(}\sum_{z=1}^{l+u}K_{j}(x_{i},x_{z})a_{z}+b_{j}\big{)}-1+\xi_{ij}\Big{)}.\\ \end{split}

\begin{split}&W(\mathbf{a},\xi,b,\mu,\eta)\\ &=\frac{1}{nl}\sum_{i=1}^{l}\sum_{j=1}^{n}\xi_{ij}+\frac{1}{2}\mathbf{a}^{T}(2\gamma_{A}G+2\gamma_{I}G\mathcal{M}G)\mathbf{a}-\sum_{i=1}^{l}\sum_{j=1}^{n}\eta_{ij}\xi_{ij}\\ &-\sum_{i=1}^{l}\sum_{j=1}^{n}\mu_{ij}\Big{(}y_{ij}\big{(}\sum_{z=1}^{l+u}K_{j}(x_{i},x_{z})a_{z}+b_{j}\big{)}-1+\xi_{ij}\Big{)}.\\ \end{split}

\frac{\partial W}{\partial b _{j}} = 0 \Rightarrow i = 1 \sum l μ_{ij} y_{ij} = 0, j = 1, \dots, n, \frac{\partial W}{\partial ξ _{ij}} = 0 \Rightarrow \frac{1}{n l} - μ_{ij} - η_{ij} = 0 \Rightarrow 0 \leq μ_{ij} \leq \frac{1}{n l} .

\frac{\partial W}{\partial b _{j}} = 0 \Rightarrow i = 1 \sum l μ_{ij} y_{ij} = 0, j = 1, \dots, n, \frac{\partial W}{\partial ξ _{ij}} = 0 \Rightarrow \frac{1}{n l} - μ_{ij} - η_{ij} = 0 \Rightarrow 0 \leq μ_{ij} \leq \frac{1}{n l} .

\begin{split}W^{R}(\mathbf{a},\mu)&=\frac{1}{2}\mathbf{a}^{T}\Big{(}2\gamma_{A}G+2\gamma_{I}G\mathcal{M}G\Big{)}\mathbf{a}-\mathbf{a}^{T}GJ^{T}Y_{d}\mu+\mu^{T}\mathbf{1},\\ \mathrm{s.t.}\ &\sum_{i=1}^{l}\mu_{ij}y_{ij}=0,j=1,\ldots,n,\\ &0\leq\mu_{ij}\leq\frac{1}{nl},i=1,\ldots,l,j=1,\ldots,n,\end{split}

\begin{split}W^{R}(\mathbf{a},\mu)&=\frac{1}{2}\mathbf{a}^{T}\Big{(}2\gamma_{A}G+2\gamma_{I}G\mathcal{M}G\Big{)}\mathbf{a}-\mathbf{a}^{T}GJ^{T}Y_{d}\mu+\mu^{T}\mathbf{1},\\ \mathrm{s.t.}\ &\sum_{i=1}^{l}\mu_{ij}y_{ij}=0,j=1,\ldots,n,\\ &0\leq\mu_{ij}\leq\frac{1}{nl},i=1,\ldots,l,j=1,\ldots,n,\end{split}

a^{*} = (2 γ_{A} I + 2 γ_{I} M G)^{- 1} J^{T} Y_{d} μ^{*} .

a^{*} = (2 γ_{A} I + 2 γ_{I} M G)^{- 1} J^{T} Y_{d} μ^{*} .

μ^{*} = s.t. a r g ma x_{μ \in R^{n l}} μ^{T} 1 - \frac{1}{2} μ^{T} S μ i = 1 \sum l μ_{ij} y_{ij} = 0, j = 1, \dots, n, 0 \leq μ_{ij} \leq \frac{1}{n l}, i = 1, \dots, l, j = 1, \dots, n,

μ^{*} = s.t. a r g ma x_{μ \in R^{n l}} μ^{T} 1 - \frac{1}{2} μ^{T} S μ i = 1 \sum l μ_{ij} y_{ij} = 0, j = 1, \dots, n, 0 \leq μ_{ij} \leq \frac{1}{n l}, i = 1, \dots, l, j = 1, \dots, n,

W (β, θ) = s.t. W^{R} (a^{*}, μ^{*}) + γ_{B} ∥ β ∥_{2}^{2} + γ_{C} ∥ θ ∥_{2}^{2} v = 1 \sum V β_{v} = 1, β_{v} \geq 0; v = 1 \sum V θ_{v} = 1, θ_{v} \geq 0, \forall v .

W (β, θ) = s.t. W^{R} (a^{*}, μ^{*}) + γ_{B} ∥ β ∥_{2}^{2} + γ_{C} ∥ θ ∥_{2}^{2} v = 1 \sum V β_{v} = 1, β_{v} \geq 0; v = 1 \sum V θ_{v} = 1, θ_{v} \geq 0, \forall v .

W (β) = s.t. β^{T} H β + γ_{B} ∥ β ∥_{2}^{2} - h^{T} β v \sum β_{v} = 1, β \geq 0, v = 1, \dots, V,

W (β) = s.t. β^{T} H β + γ_{B} ∥ β ∥_{2}^{2} - h^{T} β v \sum β_{v} = 1, β \geq 0, v = 1, \dots, V,

β_{i}^{*} = β_{j}^{*} = \frac{2 γ _{B} ( β _{i} + β _{j} ) + ( h _{i} - h _{j} ) + 2 t _{ij}}{2 ( H _{ii} - H _{j i} - H _{ij} + H _{j j} ) + 4 γ _{B}}, β_{i} + β_{j} - β_{i}^{*},

β_{i}^{*} = β_{j}^{*} = \frac{2 γ _{B} ( β _{i} + β _{j} ) + ( h _{i} - h _{j} ) + 2 t _{ij}}{2 ( H _{ii} - H _{j i} - H _{ij} + H _{j j} ) + 4 γ _{B}}, β_{i} + β_{j} - β_{i}^{*},

β_{i}^{*} = 0, β_{j}^{*} = β_{i} + β_{j}, if 2 γ_{B} (β_{i} + β_{j}) + (h_{i} - h_{j}) + 2 t_{ij} \leq 0, β_{j}^{*} = 0, β_{i}^{*} = β_{i} + β_{j}, if 2 γ_{B} (β_{i} + β_{j}) + (h_{j} - h_{i}) + 2 t_{j i} \leq 0.

β_{i}^{*} = 0, β_{j}^{*} = β_{i} + β_{j}, if 2 γ_{B} (β_{i} + β_{j}) + (h_{i} - h_{j}) + 2 t_{ij} \leq 0, β_{j}^{*} = 0, β_{i}^{*} = β_{i} + β_{j}, if 2 γ_{B} (β_{i} + β_{j}) + (h_{j} - h_{i}) + 2 t_{j i} \leq 0.

W (θ) = s.t s^{T} θ + γ_{C} ∥ θ ∥_{2}^{2} v \sum θ_{v} = 1, θ_{v} \geq 0, v = 1, \dots, V,

W (θ) = s.t s^{T} θ + γ_{C} ∥ θ ∥_{2}^{2} v \sum θ_{v} = 1, θ_{v} \geq 0, v = 1, \dots, V,

θ_{i}^{*} = 0, θ_{j}^{*} = θ_{i} + θ_{j}, if 2 γ_{C} (θ_{i} + θ_{j}) + (s_{j} - s_{i}) \leq 0, θ_{j}^{*} = 0, θ_{i}^{*} = θ_{i} + θ_{j}, if 2 γ_{C} (θ_{i} + θ_{j}) + (s_{i} - s_{j}) \leq 0, θ_{i}^{*} = \frac{2 γ _{C} ( θ _{i} + θ _{j} ) + ( s _{j} - s _{i} )}{4 γ _{C}}, θ_{j}^{*} = θ_{i} + θ_{j} - θ_{i}^{*}, else .

θ_{i}^{*} = 0, θ_{j}^{*} = θ_{i} + θ_{j}, if 2 γ_{C} (θ_{i} + θ_{j}) + (s_{j} - s_{i}) \leq 0, θ_{j}^{*} = 0, θ_{i}^{*} = θ_{i} + θ_{j}, if 2 γ_{C} (θ_{i} + θ_{j}) + (s_{i} - s_{j}) \leq 0, θ_{i}^{*} = \frac{2 γ _{C} ( θ _{i} + θ _{j} ) + ( s _{j} - s _{i} )}{4 γ _{C}}, θ_{j}^{*} = θ_{i} + θ_{j} - θ_{i}^{*}, else .

k(x_{i},x_{j})=\mathrm{exp}\big{(}-\lambda^{-1}d(x_{i},x_{j})\big{)},

k(x_{i},x_{j})=\mathrm{exp}\big{(}-\lambda^{-1}d(x_{i},x_{j})\big{)},

AP = \frac{\sum _{k} P ( k )}{# { positive samples }},

AP = \frac{\sum _{k} P ( k )}{# { positive samples }},

AP = \frac{1}{11} r \sum P (r),

AP = \frac{1}{11} r \sum P (r),

RL (f, x_{i}) = \frac{1}{∣ Y _{i} ∣ ( P - ∣ Y _{i} ∣ )} ∣ {(y_{1}, y_{2}) ∣ f (x, y_{1}) \leq f (x, y_{2}), y_{1} \in Y_{i}, y_{2} \neq \in Y_{i}} ∣,

RL (f, x_{i}) = \frac{1}{∣ Y _{i} ∣ ( P - ∣ Y _{i} ∣ )} ∣ {(y_{1}, y_{2}) ∣ f (x, y_{1}) \leq f (x, y_{2}), y_{1} \in Y_{i}, y_{2} \neq \in Y_{i}} ∣,

W_{ij} =

W_{ij} =

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Multi-view Vector-valued Manifold Regularization for Multi-label Image Classification

Yong Luo, Dacheng Tao, Chang Xu, Chao Xu,

Hong Liu, Yonggang Wen Y. Luo, C. Xu, and C. Xu are with the Key Laboratory of Machine Perception (Ministry of Education), School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China.D. Tao is with the Centre for Quantum Computation and Intelligent Systems, University of Technology, Sydney, Jones Street, Ultimo, NSW 2007, Sydney, Australia.H. Liu is with the Engineering Lab on Intelligent Perception for Internet of Things, Shenzhen Graduate School, Peking University, China. (Email: [email protected])Y. Wen is with the Division of Networks and Distributed Systems, School of Computer Engineering, Nanyang Technological University, Singapore. (Email: [email protected])©2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Abstract

In computer vision, image datasets used for classification are naturally associated with multiple labels and comprised of multiple views, because each image may contain several objects (e.g. pedestrian, bicycle and tree) and is properly characterized by multiple visual features (e.g. color, texture and shape). Currently available tools ignore either the label relationship or the view complementary. Motivated by the success of the vector-valued function that constructs matrix-valued kernels to explore the multi-label structure in the output space, we introduce multi-view vector-valued manifold regularization (MV3MR) to integrate multiple features. MV3MR exploits the complementary property of different features and discovers the intrinsic local geometry of the compact support shared by different features under the theme of manifold regularization. We conducted extensive experiments on two challenging, but popular datasets, PASCAL VOC’ 07 (VOC) and MIR Flickr (MIR), and validated the effectiveness of the proposed MV3MR for image classification.

Index Terms:

Image classification, semi-supervised, multi-label, multi-view, manifold

I Introduction

Anatural image can be summarized by several keywords (or labels). To conduct image classification by directly using binary classification methods [1, 2], it is necessary to assume that labels are independent, although most labels appearing in one image are related to one another. Examples are given in Fig. 1, where A1-A3 shows a “person” rides a “motorbike”, B1-B3 indicates “sea” usually co-occurs with “sky” and C1-C3 shows some “clouds” in the “sky”. This multi-label nature makes image classification intrinsically different from simple binary classification.

Moreover, different labels cannot be properly characterized by a single feature representation. For example, the color information (e.g. color histogram), shape cue (encoded in SIFT [3]) and global structure (e.g. GIST [4]) can effectively represent natural substances (e.g. sky, cloud and plant life), man-made objects (e.g. aeroplane, motorbike and TV-monitor) and scenes (e.g. seaside and indoor), respectively, but cannot simultaneously illustrate all these concepts in an effective way. Each visual feature encodes a particular property of images and characterizes a particular concept (label), so we treat each feature representation as a particular view for characterizing images. Fig. 1 (a)-(c) indicate that SIFT representation is effective in describing a motorbike and GIST can capture the global structure of a person on the motorbike. Fig. 1 (d)-(f) shows that GIST performs well in recognizing seaside scenes, while the color information can be used as a complementary aid for recognizing the blue sea water. From Fig. 1 (g)-(i) we can see that RGB usually represents cloud well and GIST is helpful when RGB fails. For example, the RGB representations of C1 and C3 are not very similar and their GIST distance (0.22) is very small due to the sky scene structure. This multi-view nature distinguishes image classification from single-view tasks, such as texture segmentation [5] and face recognition [6].

The vector-valued function [7] has recently been introduced to resolve multi-label classification [8] and has been demonstrated to be effective in semantic scene annotation. This method naturally incorporates the label-dependencies into the classification model by first computing the graph Laplacian [9] of the output similarity graph, and then using this graph to construct a vector-valued kernel. This model is superior to most of the existing multi-label learning methods [10, 11, 12] because it naturally considers the label correlations and efficiently outputs all the predicted labels at one time.

Although the vector-valued function is effective for general multi-label classification tasks, it cannot directly handle image classification problems that include images represented by multi-view features. A popular solution is to concatenate all the features into a long vector. This concatenation strategy not only ignores the physical interpretations of different features, however, but also encounters the over-fitting problem given limited training samples.

We thus introduce multi-kernel learning (MKL) to the vector-valued function and present a multi-view vector-valued manifold regularization (MV3MR) framework for handling the multi-view features in multi-label image classification. MV3MR associates each view with a particular kernel, assigns a higher weight to the view/kernel carrying more discriminative information, and explores the complementary nature of different views.

In particular, MV3MR assembles the multi-view information through a large number of unlabeled images to discover the intrinsic geometry embedded in the high dimensional ambient space of the compact support of the marginal distribution. The local geometry, approximated by the adjacency graphs induced from multiple kernels of all the corresponding views, is more reliable than that approximated by the adjacency graph induced from a particular kernel of any corresponding view. In this way, MV3MR essentially improves the vector-valued function for multi-label image classification.

Because the hinge loss is more suitable for classification than the least squares loss [13], we derive an SVM (support vector machine) formulation of MV3MR which results in a multi-view vector-valued Laplacian SVM (MV3LSVM). We carefully design the MV3LSVM algorithm so that it determines the set of kernel weights in the learning process of the vector-valued function.

We thoroughly evaluate the proposed MV3LSVM algorithm on two challenging datasets, PASCAL VOC’ 07 [14] and MIR Flickr [15], by comparing it with a popular MKL algorithm [16], a recently proposed MKL method [17], and competitive multi-label learning algorithms for image classification, such as multi-label compressed sensing [18], canonical correlation analysis [12] and vector-valued manifold regularization [8] in terms of mean average precision (mAP), mean area under curve (mAUC) and hamming loss (HL). The experimental results suggest the effectiveness of MV3LSVM.

The rest of the paper is organized as follows. Section II summarizes the recent work in multi-label learning, multi-kernel learning and image classification. In Section III, we introduce manifold regularization and its vector-valued generalization. We depict the proposed MV3MR framework and its SVM formulation in Section IV. Extensive experiments are presented in Section V and we conclude this paper in Section VI.

II Related Work

II-A Multi-label learning

Multi-label classification has received intensive attention in recent years. Some methods extend traditional multi-class algorithms to cope with the multi-label problem. AdaBoost.MH [19] adds the label value to the feature vector and then applies AdaBoost on weak classifiers. A ranking algorithm is presented in [20] by adopting the ranking loss as the cost function in SVM. ML-KNN [21] is an extension of the $k$ -nearest neighbor (KNN) algorithm to deal with multi-label data and canonical correlation analysis (CCA) has also recently been extended to the multi-label case by formulating it as a least-squares problem [12].

Other works concentrate on preprocessing the data so that standard binary or multi-class techniques can be utilized. For example, multiple labels of a sample belong to a subset of the whole label set and we can view this subset as a new class [22]. This may lead to a large number of classes and a more common strategy is to learn a binary classifier for each label [1, 2]. Considering that the labels are often sparse, a compressed sensing method is proposed for multi-label prediction [18].

Various approaches have been proposed to improve prediction accuracy by exploiting label correlations [23, 24, 11, 8]. Sun et al. [23] proposed the construction of a hypergraph to exploit the label dependencies. In [24], a common subspace is assumed to be shared among all labels, and the correlation information contained in different labels can be captured by learning this low-dimensional subspace. A max-margin method is proposed in [11], where the prior knowledge of the label correlations is incorporated explicitly in the multi-label classification model.

None of the approaches mentioned above consider the features to be used; however, an image with multiple labels usually indicates that it contains multiple objects. As far as we know, there is no single kind of feature that can describe a variety of objects very well. Therefore, how to combine different features is a critical issue in multi-label image classification and we consider MKL for this purpose in this paper.

II-B MKL: Multi-kernel learning

Classical kernel methods are usually based on a single kernel [25]. MKL [26], in which a kernel-based classifier and a convex combination of the kernels are learned simultaneously, has attracted much attention. Lanckriet et al. [26] introduces MKL for binary classification and solves it with semi-definite programming (SDP) techniques. The MKL problem is further developed by Sonnenburg et al. [27] in the presentation of a semi-infinite linear program (SILP). In [16], MKL is reformulated by using a weighted L2-norm regularization to replace the mixed-norm regularization and adding an L1-norm constraint on the kernel weights. All of these MKL formulations are based on SVM and are not naturally designed for multi-label classification. The proposed MV3MR framework extends MKL to handle the multi-label problem and model label inter-dependencies.

II-C Image classification

Image classification has been widely used in many computer vision-related applications such as image retrieval and web content browsing. In recent years, more than a dozen methods have been proposed and representative works can be grouped into three categories.

$\bullet$ Single-view learning for image classification: This category contains many recent image classification schemes, e.g. dictionary learning [28] and spatial pyramid matching [29]. For example, Labusch et al. [30] proposed to integrate sparse-coding and local maximum operation to extract local features for handwritten digit recognition. In [31], a non-linear coding scheme was introduced for local descriptors such as SIFT. Yang et al. [32] explored the local co-occurrences of visual words over the spatial pyramid.

$\bullet$ Multi-view learning for image classification: Schemes in this category utilize the features from different views (or multi-view features) to boost image classification performance. In this paper, the concept “views” used for learning refer to different features or attributes for depicting the objects to be classified. It should be noted that for some other applications in vision and graphics, the “views” mean different spatial viewpoints [33, 34, 35]. A semi-supervised boosting algorithm is proposed in [36], in which images measured by different views are used to construct a prior and formulate a regularization term. Guillaumin et al. [2] combined 15 visual representations (e.g. SIFT, GIST and HSV) with the tag feature for semi-supervised image classification. Combining the visual and textual information has been utilized for clustering [37] and web page classification [38].

$\bullet$ Multi-label learning for image classification: This category is motivated by the success of multi-label learning and has demonstrated promising image classification performance. For example, Bucak et al. [39] proposed a ranking based algorithm to tackle the multi-label problem with incompletely labeled data by introducing a group lasso regularizer in optimization. Unlike traditional multi-label methods that always consider positive label correlations, a novel approach is presented in [40] to make use of the negative relationship of categories.

Although it has been widely acknowledged that both multi-view representation and label inter-dependencies are important for multi-label image classification, most of the existing approaches do not take both of them into consideration. Most existing multi-view approaches assume that different views (features) contribute equally to label prediction. In contrast to these approaches, the proposed MV3MR naturally explores both the complementary property of multi-view features and the correlations of different labels under the manifold regularization scheme.

III Manifold Regularization and Vector-valued Generalization

This section briefly introduces the manifold regularization framework [9] and its vector-valued generalization [8]. Given a set of $l$ labeled examples $D_{l}=\{(x_{i},y_{i})_{i=1}^{l}\}$ and a relatively large set of $u$ unlabeled examples $D_{u}=\{(x_{i})_{i=l+1}^{N=l+u}\}$ , we consider a non-parametric estimation of a vector-valued function $f:\mathcal{X}\mapsto\mathcal{Y}$ , where $\mathcal{Y}=\mathbb{R}^{n}$ and $n$ is the number of labels. This setting includes $\mathcal{Y}=\mathbb{R}$ as a special case for regression and classification.

III-A Manifold regularization

Manifold learning has been widely used for capturing the local geometry [41] and conducting low-dimensional embedding [42, 43]. In manifold regularization, the data manifold is characterized by a nearest neighbor graph $\mathcal{W}$ , which explores the geometric structure of the compact support of the marginal distribution. The Laplacian $\mathcal{L}$ of $\mathcal{W}$ and the prediction $\mathbf{f}=[f(x_{1}),\ldots,f(x_{N})]$ are then formulated as a smoothness constraint $\|f\|_{I}^{2}=\mathbf{f}^{T}\mathcal{L}\mathbf{f}$ , where $\mathcal{L}=\mathcal{D}-\mathcal{W}$ and the diagonal matrix $\mathcal{D}$ is given by $\mathcal{D}_{ii}=\sum_{j=1}^{N}\mathcal{W}_{ij}$ . The manifold regularization framework minimizes the regularized loss

[TABLE]

where $L$ is a predefined loss function; $k$ is the standard scalar-valued kernel, i.e. $k:\mathcal{X}\times\mathcal{X}\mapsto\mathbb{R}$ and $\mathcal{H}_{k}$ is the associated reproducing kernel Hilbert space (RKHS). Here, $\gamma_{A}$ and $\gamma_{I}$ are trade-off parameters to control the complexities of $f$ in the ambient space and the compact support of the marginal distribution. The Representer theorem [9] ensures the solution of problem (1) takes the form $f^{*}(x)=\sum_{i=1}^{N}\alpha_{i}k(x,x_{i})$ , where $\alpha_{i}\in\mathbb{R}$ is the coefficient. Since a pair of close samples means that the corresponding conditional distributions are similar, the manifold regularization $\|f\|_{I}^{2}$ helps the function learning.

III-B Vector-valued manifold regularization

In the vector-valued RKHS, where a kernel function $K$ is defined and the corresponding $\mathcal{Y}$ -valued RKHS is denoted by $\mathcal{H}_{K}$ , the optimization problem of the vector-valued manifold regularization (VVMR) is given by

[TABLE]

where $\mathcal{Y}^{u+l}$ is the $u+l$ -direct product of $\mathcal{Y}$ and the inner product takes the form

[TABLE]

The function prediction $\mathbf{f}=[f(x_{1}),\ldots,f(x_{u+l})]\in\mathcal{Y}^{u+l}$ . The matrix $\mathcal{M}$ is a symmetric positive operator that satisfies $\langle\mathbf{y},\mathcal{M}\mathbf{y}\rangle\geq 0$ for all $\mathbf{y}\in\mathcal{Y}^{u+l}$ and is chosen to be $\mathcal{L}\otimes I_{n}$ . Here, $\mathcal{L}$ is the graph Laplacian, $I_{n}$ is the $n\times n$ identity matrix and $\otimes$ denotes the Kronecker (tensor) matrix product. For $\mathcal{Y}=\mathbb{R}^{n}$ , an entry $K(x_{i},x_{j})$ of the $n\times n$ vector-valued kernel matrix is defined by

[TABLE]

where $k(\cdot,\cdot)$ is a scalar-valued kernel, $\gamma_{O}\in[0,1]$ is a parameter. Here, $\mathcal{L}_{out}^{\dagger}$ is the pseudo-inverse of the output labels’ graph Laplacian. The graph can be estimated by viewing each label as a vertex and using the nearest neighbors method. The representation of the $j$ ’th label is the $j$ ’th column in the label matrix $Y\in\mathbb{R}^{N\times n}$ , in which, $Y_{ij}=1$ if the $j$ ’th label is manually assigned to the $i$ ’th sample, and $-1$ otherwise. For the unlabeled samples, $Y_{ij}=0$ .

It has been proved in [8] that the solution of the minimization problem (2) takes the form $f^{*}(x)=\sum_{i=1}^{N}K(x,x_{i})a_{i}$ . By choosing the Regularization Least Squares (RLS) loss $L(f,x_{i},y_{i})=(f(x_{i})-y_{i})^{2}$ , we can estimate the column vector $\mathbf{a}=\{a_{1},\ldots,a_{u+l}\}\in\mathbb{R}^{n(u+l)}$ with each $a_{i}\in\mathcal{Y}$ by solving a Sylvester Equation:

[TABLE]

where $a=vec(A^{T})$ and $Q=(\gamma_{O}\mathcal{L}_{out}^{\dagger}+(1-\gamma_{O})I_{n})$ and $J_{l}^{N}$ is a diagonal matrix with the first $l$ entries $1$ , and the others [math]. Here, $G^{k}$ is the Gram matrix of the scalar-valued kernel $k$ over the labeled and unlabeled data. We refer to [8] for a detailed description of the vector-valued Laplacian RLS.

IV MV3MR: Multi-view Vector-valued Manifold Regularization

To handle multi-view multi-label image classification, we generalize VVMR and present multi-view vector-valued manifold regularization (MV3MR). In contrast to [2], which assumes that different views contribute equally to the classification, MV3MR assumes that different views contribute to the classification differently and learns the combination coefficients to integrate different views.

Fig. 2 gives an illustrative example which suggests that different views contribute to the classification differently, and that learning the combination coefficients to integrate different views benefits the classification. Given five images from two classes, namely three cars of different colors (silvery white, blue and red) and two different sky images, the optimal Gram matrix $G_{opt}^{k}$ is shown on the right side for separating these images into two classes. On the left, there are four Gram matrices, which are two single Gram matrices $G_{1}^{k}$ , $G_{2}^{k}$ obtained from two different views, and their mean $G_{avg}^{k}$ , as well as their linear combination $G_{comb}^{k}$ with the learned coefficients. The figure indicates that $G_{comb}^{k}$ is closer to the optimal Gram matrix $G_{opt}^{k}$ than $G_{avg}^{k}$ .

Given a small number of labeled samples and a relative large number of unlabeled samples, MV3MR first computes an output similarity graph by using the label information of the labeled samples. The Laplacian of the label graph is incorporated in the scalar-valued Gram matrix $G_{v}^{k}$ over labeled and unlabeled data to enforce label correlations on each view, and the vector-valued Gram matrices $G_{v}=G_{v}^{k}\otimes Q,v=1,\ldots,V$ can be obtained. Meanwhile, we also compute the vector-valued graph Laplacians $\mathcal{M}_{v},v=1,\ldots,V$ by using the features of the input data from different views. Then MV3MR learns the kernel combination coefficient $\beta_{v}$ for $G_{v}$ as well as the graph weight $\theta_{v}$ for $\mathcal{M}_{v}$ by the use of alternating optimization. Finally, the combined Gram matrix $G$ together with the regularization on the combined manifold $M$ is used for classification. Fig. 3 summarizes the above procedure. Technical details are given below.

IV-A Rationality

Let $V$ be the number of views and $v$ be the view index. On the feature space of each view, we define the corresponding positive definite scalar-valued kernel $k_{v}$ , which is associated with an RKHS $\mathcal{H}_{k_{v}}$ . It follows from the functional framework [16] that by introducing a non-negative coefficient $\beta_{v}$ , the Hilbert space $\mathcal{H}_{k_{v}}^{\prime}=\{f|f\in\mathcal{H}_{k_{v}}:\frac{\|f\|_{\mathcal{H}_{k_{v}}}}{\beta_{v}}<\infty\}$ is an RKHS with kernel $k(x,x^{\prime})=\beta_{v}k_{v}(x,x^{\prime})$ . If we define $\mathcal{H}_{k}$ as the direct sum of the space $\mathcal{H}_{k_{v}}^{\prime}$ , i.e. $\mathcal{H}_{k}=\oplus_{v=1}^{V}\mathcal{H}_{k_{v}}^{\prime}$ , then $\mathcal{H}_{k}$ is an RKHS associated with the kernel

[TABLE]

Thus, any function in $\mathcal{H}_{k}$ is a sum of functions belonging to $\mathcal{H}_{k_{v}}$ . The vector-valued kernel $K(x,x^{\prime})=k(x,x^{\prime})\otimes Q=\sum_{v=1}^{V}\beta_{v}K_{v}(x,x^{\prime})$ , where we have used the bilinearity of the Kronecker product. Each $K_{v}(x,x^{\prime})=k_{v}(x,x^{\prime})\otimes Q$ corresponds to an RKHS according to the study of RKHS for the vector-valued functions [8]. Thus, the kernel $K$ is associated with an RKHS $\mathcal{H}_{K}$ . This functional framework motivates the MV3MR framework. We will jointly learn the linear combination coefficients $\{\beta_{v}\}$ to integrate kernels for characterizing different views and the classifier coefficients $\{a_{i}\}$ in a single optimization problem. Moreover, to effectively utilize the unlabeled data, we construct graph Laplacians for different views and learn to combine all of them.

IV-B Problem formulation

Under the multi-view setting and the theme of manifold regularization, we propose to learn the vector-valued function $f$ by linearly combining the kernels and graphs from different views. The optimization problem is given by

[TABLE]

where $\beta=[\beta_{1},\ldots,\beta_{V}]^{T}$ and $\theta=[\theta_{1},\ldots,\theta_{V}]^{T}$ . Both $\gamma_{B}>0$ and $\gamma_{C}>0$ are trade-off parameters. The decision function takes the form $f(x)+b=\sum_{v}f^{v}(x)+b$ and belongs to an RKHS $\mathcal{H}_{K}$ associated with the kernel $K(x,x^{\prime})=\sum_{v}\beta_{v}K_{v}(x,x^{\prime})$ . We define $\mathcal{M}=\sum_{v}\theta_{v}\mathcal{M}_{v}$ , where each $\mathcal{M}_{v}$ is a vector-valued graph Laplacian constructed on $\mathcal{H}_{K_{v}}$ . It can be demonstrated that $\mathcal{M}$ is still a graph Laplacian.

Lemma 1

$\mathcal{M}\in S_{Nn}^{+}$ * is a vector-valued graph Laplacian.*

The notation $S_{n}^{+}$ denotes a set of $n\times n$ symmetric positive semi-definite matrices and we will use $S_{n}^{*}$ to denote a set of positive definite matrices. Then we have the following version of the Representer Theorem.

Theorem 1

For fixed sets of $\{\beta_{v}\}$ and $\{\theta_{v}\}$ , the minimizer of problem (6) admits an expansion

[TABLE]

where $a_{i}\in\mathcal{Y},1\leq i\leq N=u+l$ are some vectors to be estimated and $K(x,x_{i})=\sum_{v=1}^{V}\beta_{v}K_{v}(x,x_{i})$ . The proof of Lemma 1 and Theorem 1 are detailed in the appendix.

The hinge loss $L(f,x_{i},y_{i})=(1-y_{i}f(x_{i}))_{+}$ is more suitable for classification than least squares loss since the hinge loss results in a better convergence rate and usually higher classification accuracy; we refer to [13] for a comparison of different popular loss functions. We adopt the hinge loss in MV3MR and derive MV3LSVM as follows.

IV-C Multi-view vector-valued Laplacian SVM

Under the SVM formulation, the minimization problem of MV3MR is

[TABLE]

An unregularized bias $b_{j}$ is often added to the solution $f_{j}(x)=\sum_{i=1}^{N}K_{j}(x,x_{i})a_{i}$ in the SVM formulation. By substituting (7) into the above formulation, we can see the primal problem as follows:

[TABLE]

where $G=\sum_{v=1}^{V}\beta_{v}G_{v}$ is the combined vector-valued Gram matrix over the labeled and unlabeled samples defined on kernel $K$ , $\mathcal{M}=\sum_{v=1}^{V}\theta_{v}\mathcal{M}_{v}$ is the integrated vector-valued graph Laplacian. Here, $K_{j}(\cdot,\cdot)$ is the $j$ th row of the vector-valued kernel $K$ . We have three variables, i.e., $\mathbf{a}$ , $\beta$ and $\theta$ , to be optimized in (9). To solve this problem, we consider the following constrained optimization problem:

[TABLE]

where $F(\beta,\theta)$ equals to

[TABLE]

Here, $G$ and $\mathcal{M}$ take the form as in (9). We can omit the terms $\gamma_{B}\|\beta\|_{2}^{2}$ and $\gamma_{C}\|\theta\|_{2}^{2}$ in (11) since and are fixed. By introducing the Lagrange multipliers $\mu_{ij}$ and $\eta_{ij}$ in (11), we have

[TABLE]

By taking the partial derivative w.r.t. $\xi_{ij}$ , $b_{j}$ , and setting them to be zero, we obtain

[TABLE]

A reduced Lagrangian can be obtained by substituting the above equalities back into (12), which leads to

[TABLE]

where $J=[I\ 0]\in\mathbb{R}^{(nl)\times(nl+nu)}$ and $I$ is an $nl\times nl$ identity matrix. Here, $\mu=\{\mu_{1},\ldots,\mu_{l}\}\in\mathbb{R}^{nl}$ is a column vector with each $\mu_{i}=[\mu_{i1},\ldots,\mu_{in}]^{T}$ , $Y_{d}=\mathrm{diag}(y_{11},\ldots,y_{1n},\ldots,\ldots,y_{l1},\ldots,y_{ln})$ and $\mathbf{1}$ is an all ones column vector. Taking the partial derivative of $W^{R}$ w.r.t. $\mathbf{a}$ and letting it be zero leads to:

[TABLE]

Substituting it back into (13) we get:

[TABLE]

where the matrix $S=Y_{d}JG(2\gamma_{A}I+2\gamma_{I}MG)^{-1}J^{T}Y_{d}$ . Again, the combined Gram matrix $G=\sum_{v=1}^{V}\beta_{v}G_{v}$ and the integrated graph Laplacian $\mathcal{M}=\sum_{v=1}^{V}\theta_{v}\mathcal{M}_{v}$ . Because of the strong duality, the objective value of problem (11) is also the objective value of (13), which is $W^{R}(\mathbf{a}^{*},\mu^{*})$ . Therefore, we can rewrite (10) as

[TABLE]

For fixed $\theta$ , the above problem can be rewritten with respect to $\beta$ as

[TABLE]

where $h=[h_{1},\ldots,h_{V}]^{T}$ with each $h_{v}=(\mathbf{a}^{*})^{T}G_{v}J^{T}Y_{d}\mu^{*}-\gamma_{A}(\mathbf{a}^{*})^{T}G_{v}\mathbf{a}^{*}$ and $H$ is a $V\times V$ matrix with the entry $H_{ij}=\gamma_{I}(\mathbf{a}^{*})^{T}G_{i}\mathcal{M}G_{j}\mathbf{a}^{*}$ . We can simply set the derivative of $W(\beta)$ to zero and obtain $\beta=(H+H^{T}+2\gamma_{B}I)^{-1}h$ . Then the computed $\beta$ is projected to the positive simplex to satisfy the summation and positive constraints. However, such an approach lacks convergence guarantees and may lead to numerical problems. A coordinate descent algorithm is therefore used to solve (17). In each iteration round during the coordinate descent procedure, two elements $\beta_{i}$ and $\beta_{j}$ are selected to be updated while the others are fixed. By using the Lagrangian of problem (17) and considering that $\beta_{i}+\beta_{j}$ will not change due to constraint $\sum_{v=1}^{V}\beta_{v}=1$ , we have the following solution for updating $\beta_{i}$ and $\beta_{j}$

[TABLE]

where $t_{ij}=(H_{ii}-H_{ji}-H_{ij}+H_{jj})\beta_{i}-\sum_{k}(H_{ik}-H_{jk})\beta_{k}$ . The obtained $\beta_{i}^{*}$ or $\beta_{j}^{*}$ may violate the constraint $\beta_{v}\geq 0$ . Thus, we set

[TABLE]

From the solution (18), we can see that the update criteria tends to assign larger value $\beta_{i}$ to larger $h_{i}$ and smaller $H_{ii}$ . Because $h_{i}=(\mathbf{a}^{*})^{T}G_{i}J^{T}Y_{d}\mu-\gamma_{A}(\mathbf{a}^{*})^{T}G_{i}\mathbf{a}^{*}$ and $H_{ii}=\gamma_{I}(\mathbf{a}^{*})^{T}G_{i}\mathcal{M}G_{i}\mathbf{a}^{*}$ measures the discriminative ability and the performance of the $i$ ’th view. Let $(\mathbf{a}_{i}^{*},\mu_{i}^{*})$ be the solution for the optimization problem of the $i$ ’th view, which is $W^{R}(\mathbf{a},\mu)$ with $G=G_{i}$ . If all the solutions are the same, i.e., $(\mathbf{a}_{1}^{*},\mu_{1}^{*})=\ldots=(\mathbf{a}_{V}^{*},\mu_{V}^{*})=(\mathbf{a}^{*},\mu^{*})$ , then the objective value $W^{R}(\mathbf{a}_{i}^{*},\mu_{i}^{*})$ of the discriminative view tends to be smaller than non-discriminative view (we assume that all Gram matrices have been normalized). A smaller $W^{R}(\mathbf{a}_{i}^{*},\mu_{i}^{*})$ corresponds to a larger $h_{i}$ and a smaller $H_{ii}$ , and thus our algorithm prefers discriminative view. However, the solutions $(\mathbf{a}_{1}^{*},\mu_{1}^{*}),\ldots,(a_{V}^{*},\mu_{V}^{*})$ may not exactly the same as $(\mathbf{a}^{*},\mu^{*})$ . Thus the learned $\beta_{i}$ is in general but not strictly consistent with the performance of the $i$ ’th single view. We can see this in the experiments.

For fixed $\beta$ , the problem (16) can be simplified as

[TABLE]

where $s=[s_{1},\ldots,s_{V}]^{T}$ with each $s_{v}=\gamma_{I}(\mathbf{a}^{*})^{T}G\mathcal{M}_{v}G\mathbf{a}^{*}$ . Similarly, the solution of (19) can be obtained by using the coordinate descent and the criteria for updating $\theta_{i}$ and $\theta_{j}$ in an iteration round is given by

[TABLE]

We now summarize the learning procedure of the proposed multi-view vector-valued Laplacian SVM (MV3LSVM) in Algorithm 1.

The stopping criterion for terminating the algorithm can be the difference of the objective value, $W^{R}(\mathbf{a},\mu)+\gamma_{B}\|\beta\|_{2}^{2}+\gamma_{C}\|\theta\|_{2}^{2}$ between two consecutive steps. Alternatively, we can stop the iterations when the variation of $\beta$ and $\theta$ are both smaller than a pre-defined threshold. Our implementation is based on the difference of the objective value, i.e., if the value $|O_{k}-O_{k-1}|/|O_{k}-O_{0}|$ is smaller than a predefined threshold, then the iteration stops, where $O_{k}$ is the objective value of the $k$ th iteration step. Our implementation is based on the difference of the objective value.

IV-D Convergence analysis

In this section, we discuss the convergence of the proposed MV3LSVM algorithm. We firstly prove the convexity of the problem (11), (17) and (19) as follows.

Proof:

The Hessian matrix of the objective function of (11) is $H_{e}(\mathbf{a})=\gamma_{A}G+\gamma_{I}G\mathcal{M}G$ . The Gram matrix $G\in S_{n}^{+}$ and we assume that $G$ is positive definite in this paper (to enforce this property a small ridge is added to the diagonal of $G$ ). The second term is positive semi-definite since $x^{T}G\mathcal{M}Gx=z^{T}\mathcal{M}z\geq 0$ for any $x$ and $z=Gx$ . Here, we have used the property of the graph Laplacian $\mathcal{M}\in S_{Nn}^{+}$ . Then $H_{e}(\mathbf{a})\in S_{Nn}^{*}$ for $\gamma_{A}>0$ and problem (11) is strictly convex.

For the problem (17), the Hessian matrix is $H_{e}(\beta)=H+\gamma_{B}I$ . The matrix H is symmetric since the element $H_{ij}=H_{ij}^{T}=\gamma_{I}\mathbf{a}^{T}G_{j}\mathcal{M}G_{i}\mathbf{a}=H_{ji}$ . In addition, the Cholesky decomposition $\mathcal{M}=P^{T}P$ exists since $\mathcal{M}\in S_{Nn}^{+}$ . Let $z_{i}=PG_{i}a$ , we have $H_{ij}=\gamma_{I}z_{i}^{T}z_{j}$ . Thus, $H\in S_{V}^{+}$ and $H_{e}(\beta)\in S_{V}^{*}$ for $\gamma_{B}>0$ . This means that (17) is also strictly convex.

Finally, it is straightforward to verify that the problem (19) is strictly convex for $\gamma_{C}>0$ . This completes the proof. ∎

Now we discuss the convergence of our algorithm. Let the objective function of problem (9) be $R(\mathbf{a},b,\xi,\beta,\theta)$ and the initialized value be $R(\mathbf{a}^{k},b^{k},\xi^{k},\beta^{k},\theta^{k})$ . Since the problem (11) is convex, we have $R(\mathbf{a}^{k+1},b^{k+1},\xi^{k+1},\beta^{k},\theta^{k})\leq R(\mathbf{a}^{k},b^{k},\xi^{k},\beta^{k},\theta^{k})$ . We suppose that problem (9) is exactly solved, which means that the duality gap is zero. Then $R(\mathbf{a}^{k+1},b^{k+1},\xi^{k+1},\beta^{k},\theta^{k})=W(\beta^{k},\theta^{k})$ . For fixed $\theta^{k}$ , we obtain the convex problem (17), thus we have $R(\mathbf{a}^{k+1},b^{k+1},\xi^{k+1},\beta^{k+1},\theta^{k})\leq R(\mathbf{a}^{k+1},b^{k+1},\xi^{k+1},\beta^{k},\theta^{k})$ . Similarly, due to the convexity of problem (19), we have $R(\mathbf{a}^{k+1},b^{k+1},\xi^{k+1},\beta^{k+1},\theta^{k+1})\leq R(\mathbf{a}^{k+1},b^{k+1},\xi^{k+1},\beta^{k+1},\theta^{k})$ . Therefore, the convergence of our algorithm is guaranteed.

IV-E Complexity analysis

For the proposed MV3LSVM, the complexity is dominated by the time cost of computing $a^{*}$ in each iteration, where the computation of the matrix $S$ in (15) involves an inversion and several multiplications of $nN\times nN$ matrix, and the time complexity is $O(n^{2.8}N^{2.8})$ using the Strassen algorithm [44]. Problem (15) can be solved using a standard SVM solver with the time complexity $O(n^{2.3}l^{2.3})$ according to the sequential minimal optimization (SMO) [45]. The computations of $\beta$ and $\theta$ are quite efficient since their dimensionality is $V$ , which is usually very small (e.g., $V=7$ in our experiments). Suppose the number of iterations is $k$ , then the total cost of MV3LSVM is $O(k(n^{2.8}N^{2.8}+n^{2.3}l^{2.3}))$ . Considering that $l<N$ , thus the time cost is $O(kn^{2.8}N^{2.8})$ , which is $k$ times of the case that no combination coefficients ( $\beta$ and $\theta$ ) are learned. From the experimental results shown in Section V-B, we will find that $k$ is very small since our algorithm only needs a few iterations (around five) to converge. Actually, there is a balance between the time complexity and classification accuracy. If only limited number of unlabeled samples are selected to construct the input graph Laplacians, i.e., $N=u+l$ is small. Then the time complexity can be reduced with acceptable performance sacrifice. In our experiments, we obtain satisfactory accuracy by setting $N=1000$ , and the time cost is acceptable.

V Experiment

We validate the effectiveness of MV3LSVM on two challenge datasets, PASCAL VOC’ 07 (VOC) [14] and MIR Flickr (MIR) [15]. The VOC dataset contains 10,000 images labeled with 20 categories. The MIR dataset consists of 25,000 images of 38 concepts. For the PASCAL VOC’07 dataset [14], we use the standard train/test partition [14], which splits 9,963 images into a training set of 5,011 images and a test set of 4,952 images. For the MIR Flickr dataset [15], images are randomly split into equally sized training and test sets. For both datasets, we randomly select twenty percent of the test images for validation and the rest for testing. The parameters of all the algorithms compared in our experiments are tuned by using the validation set. This means that the parameters corresponding to the best performance in the validation set are used for the transductive inference and inductive test. From the training examples, 10 random choices of $l\in\{100,200,500\}$ labeled samples are used in our experiments.

We use several visual views and the tag feature according to [2]. The visual views include SIFT features [3], local hue histograms [46], global GIST descriptor [4] and some color histograms (RGB, HSV and LAB). The local descriptors (SIFT and hue) are computed densely on the multi-scale grid and quantized using k-means, which will result in a visual word histogram for each image. Therefore, we have 7 different representations in total. We pre-compute a scalar-valued Gram matrix for each view and normalize it to unit trace. For the visual representations, the kernel is defined by

[TABLE]

where $d(x_{i},x_{j})$ denotes the distance between $x_{i}$ and $x_{j}$ , $\lambda=\mathrm{max}_{i,j}d(x_{i},x_{j})$ . Following [2], we choose the $L1$ distance for the color histogram representations (RGB, HSV and LAB), and $L2$ for GIST and $\chi^{2}$ for the visual word histograms (SIFT and hue). For the tag features, a linear kernel $k(x_{i},x_{j})=x_{i}^{T}x_{j}$ is constructed.

V-A Evaluation metrics

We use three kinds of evaluation criteria. The average precision (AP) and area under ROC curves (AUC) are utilized to evaluate the ranking performance under each label. We also use the ranking loss (RL) to study the performance of label set prediction for each instance.

$\bullet$ Average Precision (AP) evaluates the fraction of samples ranked above a particular positive sample [47]. For each label, there is a ranked sequence of samples returned by the classifier. A good classifier will rank most of the positive samples higher than the negative ones. The traditional AP is defined as

[TABLE]

where $k$ is a rank index of a positive sample and $\mathrm{P}(k)$ is the precision at the cut-off $k$ . In this paper, we choose to use the computing method as in the PASCAL VOC [14] challenge evaluation, i.e.

[TABLE]

where $\mathrm{P}(r)$ is the maximum precision over all recalls larger than $r\in\{0,0.1,0.2,\ldots,1.0\}$ . A larger value means a higher performance. In this paper, the mean AP, i.e. mAP over all labels, is reported to save space.

$\bullet$ Area Under ROC Curves (AUC) evaluates the probability that a positive sample will be ranked higher than a negative one by a classifier [48]. It is computed from an ROC curve, which depicts relative trade-offs between true positive (benefits) and false positive (costs). The AUC of a realistic classifier should be larger than 0.5. We refer to [48] for a detailed description. A larger value means a higher performance. Similar to AP, the mean AUC, i.e. mAUC over all labels, is reported.

$\bullet$ Ranking Loss (RL) evaluates the fraction of label pairs that are incorrectly ranked [19, 21]. Given a sample $x_{i}$ and its label set $Y_{i}$ , a successful classifier $f(x,y)$ should have larger value for $y\in Y_{i}$ than those $y\not\in Y_{i}$ . Then the ranking loss for the $i$ th sample is defined as:

[TABLE]

where $P$ is the total number of labels and $|\cdot|$ denotes the cardinality of a set. The smaller the value, the higher the performance. The mean value over all samples is computed for evaluation.

V-B Performance enhancement with multi-view learning

It has been shown in [8] that VVMR performs well for transductive semi-supervised multi-label classification and can provide a high-quality out-of-sample generalization. The proposed MV3MR framework is a multi-view generalization of VVMR that incorporates the advantage from MKL for handling multi-view data. Therefore, we first evaluate the effectiveness of learning the view combination weights using the proposed multi-view learning algorithm for transductive semi-supervised multi-label classification. An out-of-sample evaluation will be presented in the next subsection. The experimental setup of the two compared methods is given as follows.

$\bullet$ VVLSVM: vector-valued Laplacian SVM, which is an SVM implementation of the vector-valued manifold regularization framework that exploits both the geometry of the input data as well as the label correlations. We do not use the vector-valued Laplacian RLS presented in [8] for comparison because the hinge loss is more suitable for classification. The parameters $\gamma_{A}$ and $\gamma_{I}$ in (2) are both optimized over the set $\{10^{i}|i=-8,-7,\ldots,-2,-1\}$ . We set the parameter $\gamma_{O}$ in (3) to 1.0 since it has been demonstrated empirically in [8] that with a larger $\gamma_{O}$ , the performance will usually be better. The mean of the multiple Gram matrices and input graph Laplacians are pre-computed for experiments. The number of nearest neighbors for constructing the input and output graph Laplacians are tuned on the sets $\{10,20,\ldots,100\}$ and $\{2,4,\ldots,20\}$ respectively.

$\bullet$ MV3LSVM: an SVM implementation of the proposed MV3MR framework that combines multiple views by constructing kernels for all views and learning their weights. We tune the parameters $\gamma_{A}$ and $\gamma_{I}$ as in VVLSVM and $\gamma_{O}$ is set to 1.0. The additional parameters $\gamma_{B}$ and $\gamma_{C}$ are optimized over $\{10^{i}|i=-8,-7,\ldots,-2,-1\}$ . We firstly only learn kernel combinations $\beta$ and set the graph weights $\theta$ to be uniform (MV3LSVM1 in Fig. 4). Then we learn both $\beta$ and $\theta$ in MV3LSVM2. We use $20$ and $6$ nearest neighbor graphs to construct the input and output normalized graph Laplacians respectively for the VOC dataset, while $30$ and $8$ nearest neighbor graphs are used in the experiments on MIR. We set these hyperparameters to be the same as those in VVLSVM and no further optimization was attempted.

The experimental results on the two datasets are shown in Fig. 4. We can see that learning the combination weights using our algorithm is always superior to simply using the uniform weights for different views. We also find that when the number of labeled samples increases, the improvement becomes small. This is because the multi-view learning actually helps to approximate the underlying data distribution. This approximation can be steadily improved with the increase of the number of labeled samples, and thus the significance of the multi-view learning to the approximation gradually decreases. Besides, we observe that $\beta$ has more influence on the final performance overall.

We show the behavior of the objective values by increasing the iteration number in Fig. 5. From the figure, we can see that only a few iterations (about five) are necessary to obtain a satisfactory solution. Thus the time complexity is only a little more than the VVLSVM algorithm and can justify the performance enhancement.

Finally, our algorithm is not sensitive to different initializations, as shown in Fig. 6. In particular, we run our algorithm with 10 random choices of $\beta$ and $\theta$ . We show the performance in terms of mAP, mAUC and RL on the two datasets in Fig. 6. It can be observed that the performance curves do not vary a lot with different initializations.

V-C Out-of-sample generalization

The second set of experiments is to evaluate the out-of-sample extension quality of the MV3MR framework and the SVM implementation is utilized. Fig. 7 compares the transductive performance to the inductive performance when using $l=200$ labeled samples. We show a scatter plot of the AP scores for each label on the two datasets by using 10-random choices of labeled data. We can see that our algorithm generalizes well from the unlabeled set to the unseen set. The MV3MR framework inherits a strong natural out-of-sample generalization ability that many semi-supervised multi-label methods do not naturally have [8]. Besides, most graph-based semi-supervised learning algorithms are transductive and additional induction schemes are necessary to handle new points [49].

V-D Analysis of the combination coefficients in multi-view learning

In the following, we present empirical analyses of the multi-view learning procedure. In Fig. 8, we select $l=200$ and present the view combination coefficients $\beta$ and $\theta$ learned by MV3LSVM, together with the mAP by using VVLSVM for each view. From the results, we find that the tendency of the kernel and graph weights are both consistent with the corresponding mAP in general, i.e., the views with a higher classification performance tend to be assigned larger weights, taking the DenseSIFT visual view (the 2nd view) and the tag (the last view) for example. However, a larger weight may sometimes be assigned to a less discriminative view; for example, the weight of Hsv (the 4th view) is larger than the weight of DenseSift (the 2nd view). This is mainly because the coefficient $\mathbf{a}$ is not optimal for every single view, in which only $G_{v}$ and $\mathcal{M}_{v}$ are utilized. The learned $\mathbf{a}$ minimizes the optimization problem (8) by using the combined Gram matrix $G$ and integrated graph Laplacian $\mathcal{M}$ , which means that the learned vector-valued function is smooth along the combined RKHS and the integrated manifold. In this way, the proposed algorithm effectively exploits the complementary property of different views.

V-E Comparisons with multi-label and multi-kernel learning algorithms

Our last set of experiments is to compare MV3LSVM with several competitive multi-label methods as well as a well-known MKL algorithm in predicting the unknown labels of the unlabeled data. The out-of-sample generalization ability of our method has been verified in our second set of experiments.

We specifically compare MV3LSVM with the following methods on the challenging VOC and MIR datasets:

$\bullet$ SVM_CAT: concatenating the features of each view and then running standard SVM. The parameter $C$ is tuned on the set $\{10^{i}|i=-1,0,\ldots,7,8\}$ . The time complexity is $O(nl^{2.3})$ [45].

$\bullet$ SVM_UNI: combining different kernels by combining them with uniform weights and then running standard SVM. The parameter $C$ is tuned on the set $\{10^{i}|i=-1,0,\ldots,7,8\}$ . The time complexity is $O(nl^{2.3})$ [45].

$\bullet$ MLCS [18]: a multi-label compressed sensing algorithm that takes advantage of the sparsity of the labels. We choose the label compression ratio to be 1.0 since the number of the labels n is not very large. The mean of the multiple kernels from different views is pre-computed for experiments. Suppose the length the compressed label vector (for each sample) is $r\leq n$ . Then the training cost is $O(nl^{3})$ if we choose the regression algorithm to be the least squares [50], and the reconstruction complexity is $O(l(n^{3}+rn^{2}))$ if the least angle regression (LARS) algorithm [51] is utilized. Considering that $r\leq n\leq l$ in this paper, the time complexity of MLCS is $O(nl^{3})$ .

$\bullet$ KLS_CCA [12]: a least-squares formulation of the kernelized canonical correlation analysis for multi-label classification. The ridge parameter is chose from the candidate set $\{0,10^{i}|i=-3,-2,\ldots,2,3\}$ . The mean of multiple Gram matrices is pre-computed to run the algorithm. According to the discussion presented in [12], the time complexity is $O(n^{2}l+kn(3l+5d+2dl))$ , where $d$ is the feature dimensionality and $k$ is the number of iterations.

$\bullet$ SimpleMKL [16]: a popular SVM-based MKL algorithm that determines the combination of multiple kernels by a reduced gradient descent algorithm. The penalty factor $C$ is tuned on the set $\{10^{i}|i=-1,0,\ldots,7,8\}$ . We apply SimpleMKL to multi-label classification by learning a binary classifier for each label. According to the Algorithm 1 presented in [16], there is an outer loop for updating the kernel weights, as well as an inner loop to determine the maximal admissible step size in the reduced gradient descent. Suppose the number of outer and inner iterations are $k_{1}$ and $k_{2}$ respectively, then the time complexity of SimpleMKL is approximately $O(nk_{1}k_{2}l^{2.3})$ , where we have ignored the time cost of the SVM solver in the inner loop since it has warm start and can be very fast [16].

$\bullet$ LpMKL [17]: a recent proposed MKL algorithm, which extend MKL to $l_{p}$ -norm with $p\geq 1$ . The penalty factor $C$ is tuned on the set $\{10^{i}|i=-1,0,\ldots,7,8\}$ and we choose the norm $p$ from the set $\{1,8/7,4/3,2,4,8,16,\infty\}$ . According to the Algorithm 1 presented in [17], the time complexity is $O(nkl^{2.3})$ since the kernel combination coefficients can be computed analytically, where $k$ is the number of iterations.

The performance of the compared methods on the VOC dataset and MIR dataset are reported in Table I. The values in the last column of Table I are average ranks. From the results, we firstly observe that the performance keeps improving with the increasing number of the labeled samples. Second, the performance of the simpleMKL algorithm, which learns the kernel weights for SVM, can be inferior to the multi-label algorithms with the mean kernel in many cases. MV3LSVM is superior to multi-view (SimpleMKL and LpMKL) and multi-label algorithms in general and consistently outperforms other methods in terms of mAP. The average rank of our algorithm is smaller than all the other methods in terms of all the three criteria. According to the Friedman test [52], the statistics $F_{F}$ of mAP, mAUC and RL are $56.05$ , $3.03$ , and $5.69$ respectively. All of them are larger than the critical value $F(6,30)=2.42$ , so we reject the null-hypothesis (the compared algorithms perform equally well). In particular, by comparing with SimpleMKL, we obtain a significant $8.1\%$ mAP improvement on VOC when using 100 labeled samples. The level of improvement drops when more labeled samples are available, for the same reason described in our first set of experiments.

VI Conclusion and Discussion

Most of the existing works on multi-label image classification use only single feature representation, and the multiple feature methods usually assume that a single label is assigned to an image. However, an image is usually associated with multiple labels and different kinds of features are necessary to describe the image properly. Therefore, we have developed a multi-view vector-valued manifold regularization (MV3MR) for multi-label image classification in which images are naturally characterized by multiple views. MV3MR combines different kinds of features in the learning process of the vector-valued function for multi-label classification. We also derived an SVM formulation of MV3MR, which results in MV3LSVM. The new algorithm effectively exploits the label correlations and learns the view weights to integrate the consistency and complementary properties of different views. We evaluate the proposed algorithm in terms of three popular criteria, i.e. mAP, mAUC and RL. Intensive experiments on two challenge datasets PASCAL VOC’07 and MIR Flickr show that the support vector machine based implementation under MV3MR outperforms the traditional multi-label algorithms as well as a well-known multiple kernel learning method. Furthermore, our method provides a strategy for learning from multiple views in multi-label classification and can be extended to other multi-label algorithms.

Appendix A PROOF OF LEMMA 1

Proof:

The matrix $\mathcal{M}=\sum_{v}\theta_{v}\mathcal{M}_{v}=\sum_{v}\theta_{v}(\mathcal{L}_{v}\otimes I_{n})=\mathcal{L}\otimes I_{n}$ , where $\mathcal{L}=\sum_{v}\theta_{v}\mathcal{L}_{v}$ is defined as a convex combination of the scalar-valued graph Laplacians constructed from different views. $\mathcal{L}\in S_{N}^{+}$ since each $\mathcal{L}_{v}\in S_{N}^{+}$ , and thus we have $\mathcal{M}\in S_{Nn}^{+}$ according to the positive definite property on the Kronecker product. Here, $\mathcal{L}=\sum_{v}\theta_{v}(\mathcal{D}_{v}-\mathcal{W}_{v})$ can be computed by using the following adjacency graph

[TABLE]

where $N(x)$ denotes a set that contains the $k$ -nearest neighbors of $x$ and $\mathcal{W}_{vij}$ is the similarity between the $i$ th and $j$ th point from the $v$ th view. Thus $\mathcal{L}$ is a graph Laplacian and $\mathcal{M}$ is the corresponding vector-valued graph Laplacian. ∎

Appendix B PROOF OF THE REPRESENTER THEOREM

Proof:

It has been presented in Section IV-A that there is an RKHS $\mathcal{H}_{K}$ associated with the vector-valued kernel $K$ . The probability distribution is assumed to be supported on a manifold $M$ in the manifold regularization framework. We now denote $S=\{\sum_{i}K(x_{i},\cdot)a_{i}|x_{i}\in M,a_{i}\in\mathcal{Y}\}$ as a linear space spanned by the kernels centered at the points on $M$ . Any function $f\in\mathcal{H}_{K}$ can be decomposed as $f=f_{\parallel}+f_{\perp}$ , with $f_{\parallel}\in S$ and $f_{\perp}\in S^{\perp}$ . It has been proved in Lemma 1 that $\mathcal{M}$ is a graph Laplacian. Thus we can use $\mathcal{M}$ to induce an intrinsic norm $\|\cdot\|_{I}$ , which satisfies $\|f\|_{I}=\|g\|_{I}$ for any $f,g\in\mathcal{H}_{K}$ , $(f-g)|_{M}\equiv 0$ . According to the reproducing property, it concludes that $f_{\perp}$ vanishes on $M$ [9]. This means that for any $x\in M$ , we have $f(x)=f_{\parallel}(x)$ and then $\|f\|_{I}=\|f_{\parallel}\|_{I}$ . Besides $\|f\|_{K}^{2}=\|f_{\parallel}\|_{K}^{2}+\|f_{\perp}\|_{K}^{2}\geq\|f_{\parallel}\|_{K}^{2}$ , and thus we conclude that the minimizer of the problem (6) must lie in $S$ for fixed $\beta$ and $\theta$ . Furthermore, because $M$ is approximated by the Laplacian of the graph constructed by the labeled and unlabeled samples, we have $S=\{\sum_{i=1}^{l+u}K(x_{i},\cdot)a_{i}|a_{i}\in\mathcal{Y}\}$ . This completes the proof. ∎

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Boutell, J. Luo, X. Shen, and C. Brown, “Learning multi-label scene classification,” Pattern recognition , vol. 37, no. 9, pp. 1757–1771, 2004.
2[2] M. Guillaumin, J. Verbeek, and C. Schmid, “Multimodal semi-supervised learning for image classification,” in IEEE conference on Computer Vision and Pattern Recognition , 2010, pp. 902–909.
3[3] D. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision , vol. 60, no. 2, pp. 91–110, 2004.
4[4] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International Journal of Computer Vision , vol. 42, no. 3, pp. 145–175, 2001.
5[5] E. Çesmeli and D. Wang, “Texture segmentation using gaussian-markov random fields and neural oscillator networks,” IEEE Transactions on Neural Networks , vol. 12, no. 2, pp. 394–404, 2001.
6[6] D. Masip and J. Vitrià, “Shared feature extraction for nearest neighbor face recognition,” IEEE Transactions on Neural Networks , vol. 19, no. 4, pp. 586–595, 2008.
7[7] C. Micchelli and M. Pontil, “On learning vector-valued functions,” Neural Computation , vol. 17, no. 1, pp. 177–204, 2005.
8[8] H. Minh and V. Sindhwani, “Vector-valued manifold regularization,” in Proceedings of the 28th International Conference on Machine Learning , 2011, pp. 57–64.