EE-AE: An Exclusivity Enhanced Unsupervised Feature Learning Approach

Jingcai Guo; Song Guo

arXiv:1904.00172·cs.LG·April 2, 2019

EE-AE: An Exclusivity Enhanced Unsupervised Feature Learning Approach

Jingcai Guo, Song Guo

PDF

Open Access

TL;DR

This paper introduces EE-AE, an innovative unsupervised autoencoder method that incorporates an exclusivity concept to enhance feature learning, addressing overfitting and improving robustness and discriminability of data representations.

Contribution

It is the first to integrate the exclusivity concept into autoencoder-based unsupervised feature learning, along with improvements to stacked AE structures for better layer connection.

Findings

01

Achieves superior performance over existing methods.

02

Effectively reduces overfitting in autoencoder training.

03

Enhances robustness and discriminability of learned features.

Abstract

Unsupervised learning is becoming more and more important recently. As one of its key components, the autoencoder (AE) aims to learn a latent feature representation of data which is more robust and discriminative. However, most AE based methods only focus on the reconstruction within the encoder-decoder phase, which ignores the inherent relation of data, i.e., statistical and geometrical dependence, and easily causes overfitting. In order to deal with this issue, we propose an Exclusivity Enhanced (EE) unsupervised feature learning approach to improve the conventional AE. To the best of our knowledge, our research is the first to utilize such exclusivity concept to cooperate with feature extraction within AE. Moreover, in this paper we also make some improvements to the stacked AE structure especially for the connection of different layers from decoders, this could be regarded as a…

Tables2

Table 1. Table 1 : Accuracy (%) for COIL100

Method	20	40	60	80	100	Average
PCA [16]	90.41	89.27	87.26	85.80	84.17	87.37
KPCA [17]	90.12	88.06	85.75	83.97	82.52	86.08
NPE [18]	91.76	89.59	87.27	85.94	85.03	87.91
SCC [19]	91.49	89.39	85.91	83.91	82.91	86.72
SDNMF [20]	89.91	87.88	82.94	81.44	78.60	84.15
Denoise-AE [6]	91.75	90.52	88.58	86.35	85.17	88.47
Sparse-AE [8]	92.26	91.17	88.59	86.63	85.55	88.83
Graph-AE [9]	93.37	91.33	89.11	86.67	85.96	89.28
SSA-AE [11]	96.97	94.04	91.87	90.42	88.78	92.41
EE-AE (ours)	98.39	95.20	93.09	92.58	90.90	94.03

Table 2. Table 2 : Accuracy (%) for MNIST

Method		$1$ -set	$\frac{1}{6}$ -set
SAE [6]		98.60	90.37
SDAE [6]		98.72	91.04
EE-AE (ours)	$η = 0.0$	98.43	93.56
	$η = 0.2$	98.59	94.33
	$η = 0.4$	98.67	94.89
	$η = 0.6$	98.67	95.23
	$η = 0.8$	98.81	94.61
	$η = 1.0$	98.73	94.57

Equations23

h = a_{e} (w_{e} \cdot x + b_{e}),

h = a_{e} (w_{e} \cdot x + b_{e}),

\overset{x}{^} = a_{d} (w_{d} \cdot h + b_{d}),

\overset{x}{^} = a_{d} (w_{d} \cdot h + b_{d}),

L_{a} = ∥ x - \overset{x}{^} ∥_{2}^{2} .

L_{a} = ∥ x - \overset{x}{^} ∥_{2}^{2} .

lim S (x_{j}, \overset{ˉ}{C^{^{'}}}) \to γ,

lim S (x_{j}, \overset{ˉ}{C^{^{'}}}) \to γ,

lim S (x_{j}, \overset{ˉ}{D}) \to δ,

lim S (x_{j}, \overset{ˉ}{D}) \to δ,

L_{a} = i = 1 \sum n ∥ x_{i} - \overset{x}{^}_{i} ∥_{2}^{2} .

L_{a} = i = 1 \sum n ∥ x_{i} - \overset{x}{^}_{i} ∥_{2}^{2} .

\displaystyle\Omega\left(\varepsilon\right):\left\{\begin{array}[]{ll}\varepsilon_{i}&\textrm{if $\varepsilon_{i}\geq 0$}\\ 0&\textrm{if $\varepsilon_{i}<0$}\end{array}\right.,

\displaystyle\Omega\left(\varepsilon\right):\left\{\begin{array}[]{ll}\varepsilon_{i}&\textrm{if $\varepsilon_{i}\geq 0$}\\ 0&\textrm{if $\varepsilon_{i}<0$}\end{array}\right.,

L_{h}^{(1)} = i = 1 \sum n \frac{Ω [ f _{e} ( C ^{^{'} (i)} ˉ ) - h _{i} ] \cdot h _{i}}{Ω [ f _{e} ( C ^{^{'} (i)} ˉ ) - h _{i} ] ∥ h _{i} ∥} = i = 1 \sum n \frac{Ω [ f _{e} ( C ^{^{'} (i)} ˉ ) - f _{e} ( x _{i} ) ] \cdot f _{e} ( x _{i} )}{Ω [ f _{e} ( C ^{^{'} (i)} ˉ ) - f _{e} ( x _{i} ) ] ∥ f _{e} ( x _{i} ) ∥} .

L_{h}^{(1)} = i = 1 \sum n \frac{Ω [ f _{e} ( C ^{^{'} (i)} ˉ ) - h _{i} ] \cdot h _{i}}{Ω [ f _{e} ( C ^{^{'} (i)} ˉ ) - h _{i} ] ∥ h _{i} ∥} = i = 1 \sum n \frac{Ω [ f _{e} ( C ^{^{'} (i)} ˉ ) - f _{e} ( x _{i} ) ] \cdot f _{e} ( x _{i} )}{Ω [ f _{e} ( C ^{^{'} (i)} ˉ ) - f _{e} ( x _{i} ) ] ∥ f _{e} ( x _{i} ) ∥} .

L_{h}^{(2)} = i = 1 \sum n \frac{Ω [ f _{e} ( D ^{(i)} ˉ ) - h _{i} ] \cdot h _{i}}{Ω [ f _{e} ( D ^{(i)} ˉ ) - h _{i} ] ∥ h _{i} ∥} = i = 1 \sum n \frac{Ω [ f _{e} ( D ^{(i)} ˉ ) - f _{e} ( x _{i} ) ] \cdot f _{e} ( x _{i} )}{Ω [ f _{e} ( D ^{(i)} ˉ ) - f _{e} ( x _{i} ) ] ∥ f _{e} ( x _{i} ) ∥} .

L_{h}^{(2)} = i = 1 \sum n \frac{Ω [ f _{e} ( D ^{(i)} ˉ ) - h _{i} ] \cdot h _{i}}{Ω [ f _{e} ( D ^{(i)} ˉ ) - h _{i} ] ∥ h _{i} ∥} = i = 1 \sum n \frac{Ω [ f _{e} ( D ^{(i)} ˉ ) - f _{e} ( x _{i} ) ] \cdot f _{e} ( x _{i} )}{Ω [ f _{e} ( D ^{(i)} ˉ ) - f _{e} ( x _{i} ) ] ∥ f _{e} ( x _{i} ) ∥} .

L_{h} = L_{h}^{(1)} + (1 - L_{h}^{(2)}) .

L_{h} = L_{h}^{(1)} + (1 - L_{h}^{(2)}) .

L

L

= L_{a} + λ [L_{h}^{(1)} + (1 - L_{h}^{(2)})],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Anomaly Detection Techniques and Applications

MethodsAutoencoders · Solana Customer Service Number +1-833-534-1729

Full text

EE-AE: An Exclusivity Enhanced

Unsupervised Feature Learning Approach

Abstract

Unsupervised learning is becoming more and more important recently. As one of its key components, the autoencoder (AE) aims to learn a latent feature representation of data which is more robust and discriminative. However, most AE based methods only focus on the reconstruction within the encoder-decoder phase, which ignores the inherent relation of data, i.e., statistical and geometrical dependence, and easily causes overfitting. In order to deal with this issue, we propose an Exclusivity Enhanced (EE) unsupervised feature learning approach to improve the conventional AE. To the best of our knowledge, our research is the first to utilize such exclusivity concept to cooperate with feature extraction within AE. Moreover, in this paper we also make some improvements to the stacked AE structure especially for the connection of different layers from decoders, this could be regarded as a weight initialization trial. The experimental results show that our proposed approach can achieve remarkable performance compared with other related methods.

**Index Terms— ** Unsupervised learning, Exclusivity, Feature learning, Autoencoder.

1 Introduction and Related Work

With the development of deep learning techniques, more and more applications adopt the deep neural networks (DNNs) to handle multiple tasks and obtain successive state-of-the-art performances [1, 2]. However, most of such DNNs are categorized into the supervised learning [3] with high demands for labeled data. On the other hand, it is much easier to collect the unlabeled data, then how to utilize it falls into unsupervised learning [4]. To this issues, the natural idea is to combine them. We apply the DNNs with the fashion of unsupervised learning, and aim to obtain more robust and discriminative feature representation without the rich resource of labeled data.

As one key component of unsupervised learning, the autoencoder (AE) is a widely used and promising method aims to learn a latent feature representation of data [5]. It is flexible to implementation and can be stacked easily to form a deep structure, which endows a model the ability to perform complex information extraction and representation [6]. The AE is with an encoder-decoder structure. Raw data example $x$ is first projected to a latent feature space by the encoder, which is more compressed, sparser, or any specific nature as we need:

[TABLE]

where $w_{e}$ is the weight of encoder, $b_{e}$ is a bias and $a_{e}$ is a nonlinear activation function used to achieve nonlinearity so that the network can better model complex problem. Next, the latent feature representation $h$ is projected back to the original feature space by the decoder, in which a reconstructed data example $\hat{x}$ is obtained:

[TABLE]

where $w_{d}$ is the weight of decoder, $b_{d}$ is a bias and $a_{d}$ is a nonlinear activation function. The loss of $\hat{x}$ and $x$ is calculated as:

[TABLE]

The AE learns to minimze Eq. (3) and forces $h$ to retain the most powerful expression of data.

However, if the AE is endowed with too much capacity, the encoder and decoder may perform among themselves and eventually converge to an identity function [7]. In recent years, some related regularization approaches have been proposed to address this issue, such as Denoising AE [6], Sparse AE [8], Graph AE [9], Winner-take-all AE [10], Similarity-aware AE [11], etc. Despite the efforts made, however, most AE based methods only focus on the reconstruction and may ignore some inherent relation of data, i.e., statistical and geometrical dependence. We argue it an obstacle to learning a more robust and discriminative feature representation. Let us use the handwritten digits as an illustration. In Fig. 1, start from (a), we first obtain the global mean feature representation of whole-set. Then we randomly choose an example, i.e., a digit 4, as shown in (b). Next, we exclude (b) out from the global feature representation and obtain (c). We can observe that the obtained feature representation (c) is radically different from the chosen digit 4. Actually, in an ideal state, if we flatten these two representations, we may obtain two orthogonal vectors. Although the ideal state is not fully available, this exclusivity intuitively exists. Taking the exclusivity into account, we may deal with each example in the case of heterogeneous and homologous respectively, and discover more latent information from the raw data.

This insight inspires us to propose a novel regularization technique to boost the unsupervised feature learning within AE. In this paper, we propose a novel AE based unsupervised feature learning approach called Exclusivity Enhanced Autoencoder (EE-AE), we introduce two exclusivity constraints that can better deal with the statistical and geometrical dependence of data. Moreover, we also make some improvements to the stacked AE structure especially for the connection of different layers on the decoder part, and this makes full utilization of the layer difference and cooperation within AE. Extensive experiments are carried out on several benchmark datasets which show that the proposed EE-AE outperforms other existing representative methods.

2 Exclusivity Enhanced Autoencoder

2.1 Exclusivity Concept

Data has its own inherent relation, i.e., statistical and geometrical dependence. In unsupervised learning, we don’t take labels into account, so how to extract useful information from data itself counts. Suppose we have several examples in total: $C=\left\{x_{1},x_{2},\cdots x_{n}\right\}$ , where $C$ is the global-set includes $n$ examples. Let $x_{j}$ be a random example, then we exclude $x_{j}$ out from $C$ , the rest of global-set is $C^{{}^{\prime}}=\left\{x_{1},x_{2},\cdots x_{n}\right\}/\left\{x_{j}\right\}$ . Then we want to compare $C^{{}^{\prime}}$ with $x_{j}$ , normally we need to compare $n-1$ pairs. However, the one to one comparison is not very necessary in this case. So we consider to compare $x_{j}$ with the global of $C^{{}^{\prime}}$ , here we replace $C^{{}^{\prime}}$ with the mean feature representation $\bar{C^{{}^{\prime}}}$ . The similarity $S$ between $x_{j}$ and $\bar{C^{{}^{\prime}}}$ is excepted to have a lower bound $\gamma$ :

[TABLE]

where $S$ is a similarity function and $\gamma\rightarrow 0$ under cosine measurement. On the other hand, the rest of global-set $C^{{}^{\prime}}$ still contains a certain number of examples that belong to the same class of $x_{j}$ . So we can exclude $m$ examples out from $C^{{}^{\prime}}$ and obtains: $D=\left\{x_{1}^{m},x_{2}^{m},\cdots x_{m}^{m}\right\}\subseteq C^{{}^{\prime}}$ , where $D$ contains $m\left(m\ll n\right)$ examples that are most similar with $x_{j}$ . We also replace $D$ with the mean feature representation $\bar{D}$ , the similarity $S$ between $x_{j}$ and $\bar{D}$ is excepted to have a upper bound $\delta$ :

[TABLE]

where $\delta\rightarrow 1$ under cosine measurement.

Statistically speaking, the variation between $x_{j}$ and $\bar{C^{{}^{\prime}}}$ should be large enough, and the variation between $x_{j}$ and $\bar{D}$ should be small enough. Geometrically speaking, $x_{j}$ and $\bar{C^{{}^{\prime}}}$ should be far enough, and $x_{j}$ and $\bar{D}$ should be close enough. We use $\bar{C^{{}^{\prime}}}$ to define the heterogeneous case (i.e., different classes) to example $x_{j}$ , and use $\bar{D}$ to define the homologous case (i.e., same class). Although the ideal state is not fully available, i.e., $C^{{}^{\prime}}$ contains a certain number of examples whose class is the same as $x_{j}$ , the impact is limited in the global perspective, the exclusivity intuitively exists.

2.2 Exclusivity Enhanced Autoencoder

The conventional AE focuses on the reconstruction from the encoder-decoder phase. We consider $n$ examples dataset, similar with Eq. (3), the objective can be written as:

[TABLE]

This objective ignores some inherent relation of data and easily causes overfitting. Inspired by the concept of exclusivity, we find the heterogeneous case $\bar{C^{{}^{\prime}(i)}}$ and the homologous case $\bar{D^{(i)}}$ for each example $x_{i}$ in the original feature space. Then, we focus on the latent feature space and force $h_{i}$ to apart from $f_{e}(\bar{C^{{}^{\prime}(i)}})$ and close to $f_{e}(\bar{D^{(i)}})$ , respectively. Where $f_{e}(\cdot)$ is the encoder that projects element to latent feature space.

To this end, we first introduce an auxiliary function $\Omega(\cdot)$ that activates vector $\varepsilon$ on dimension-wise:

[TABLE]

where we activate each dimension $\varepsilon_{i}$ in vector $\varepsilon$ . Then, in order to force the latent feature representation of data to apart from their heterogeneous case, we consider a new constraint $L_{h}^{(1)}$ to maximize the variation between examples and their heterogeneous case:

[TABLE]

We can further minimize the variation between examples and their homologous case by considering another constraint $L_{h}^{(2)}$ :

[TABLE]

For these two functions, the $L_{h}^{(1)}$ has a lower bound that close to 0, and the $L_{h}^{(2)}$ has an upper bound that close to 1. In order to jointly optimize these two terms with conventional AE objective, we subtract $L_{h}^{(2)}$ from 1, and combine these two terms:

[TABLE]

Finally, the unified objective of our Exclusivity Enhanced Autoencoder (EE-AE) can be described as:

[TABLE]

where we jointly minimize this objective during training. $L_{a}$ is the conventional AE reconstruction loss, $L_{h}^{(1)}$ is the exclusivity constraint that maximizes the variation between examples and their heterogeneous case, $(1-L_{h}^{(2)})$ is the exclusivity constraint that minimizes the variation between examples and their homologous case. $\lambda$ is an hyper-parameter controls the balance between the conventional reconstruction loss and added constraints.

2.3 Stacked Autoencoders Structure

The conventional stacked autoencoders (AEs) adopt the stacked structure to every single AE [6, 12]. Each time, the latent feature representation $h$ of the current AE is fed to the input of next AE. Such a process is repeated with certain steps to realize the stacked structure.

Recently, Yosinski et al. [13] experimentally quantified the generality versus specificity of each layer in DNNs, and shown that the feature representation of an example normally go through a specific $\rightarrow$ general $\rightarrow$ specific three level phases from the input to output layers, the deeper the more general. Inspired by this analysis, we propose to fully consider the connection of different layers from decoders. Specifically, we not only stack the encoder but also the decoder. Then in the last stacked structure network, we further fine-tune the whole network to achieve better cooperation of different AE layers (Fig. 2). This can be regarded as a weight initialization trial. we constrain the weight variance ratio of layers as: $1-\eta\leq\frac{\left\|W_{F}\right\|_{p.1}}{\left\|{W_{F}}^{\prime}\right\|_{p.1}}\leq 1+\eta$ , where $W_{F}$ and ${W_{F}}^{\prime}$ are the original and fine-tuned weight respectively. $\left\|\cdot\right\|_{p.1}$ is the $l_{p}$ norm on a vector and $\eta$ is a hyper-parameter controls the weight variance ratio of layers.

3 Experiments

We compare our proposed Exclusivity Enhanced Autoencoder (EE-AE) with several state-of-the-art methods on two widely used datasets COIL100 [14] and MNIST [15]. The experimental results and analysis are discussed in detail for each dataset respectively.

3.1 Results on COIL100

Experiment Setup: The COIL100 contains 7200 color images of 100 objects. Images of the objects are taken at pose intervals of 5 degrees, corresponding to 72 poses per object. We convert these images to grayscale images and resize them to 32x32 pixels. We randomly select 10 images for each object to form the training set and the rest images are the testing set. For the training set, we also consider their horizontal mirror feature. We use our proposed EE-AE for unsupervised feature learning. The nearest neighbor classifier whose inputs are the learned feature representations from EE-AE is applied to achieve the recognition. These processes are repeated 10 times and we report the average recognition results. We also split the dataset to 20, 40, 60, 80 object subsets and compare the recognition accuracy on these subsets and whole-set respectively. We compare our proposed approach with 9 existing representative methods (Table 1). All competitors and our EE-AE are under the same settings and tuned to achieve their best performance. For all AE based methods, we also adopt the same network architecture and report the results on the learned feature representations of the last stacked hidden embedding layer.

Results & Analysis: The results are shown in Table 1. We can see that our proposed EE-AE outperforms all competitors with great advantage on each subset contains 20, 40, 60 and 80 objects, and whole-set (100 objects) respectively. What’s more, we can also have the observation that all the AE based methods generally obtain better performance than traditional methods, which highlights the merit of AE for unsupervised feature learning. Then we further evaluate the sensitivity of the hyper-parameters: (1) The $\lambda$ controls the balance between the conventional reconstruction loss and exclusivity constraints in Eq. (11). (2) The number ( $m$ ) of most similar examples when finding the homologous case ( $\bar{D}$ ) for each example; (3) The number of stacked AEs (denote as $s$ for simplicity). The evaluation is conducted on the 20-object subset of COIL100. When evaluating the sensitivity of one hyper-parameter, we fix others to their best points. Fig. 3 shows the results. We can see that as $\lambda$ becomes larger, the performance of our model becomes better and then decreases, the optimal $\lambda$ is around 7. As to $m$ , as it becomes larger, the performance of our model becomes better and finally tends to be relatively stable, and the optimal $m$ is around 6. As to $s$ , the optimal number is 3.

3.2 Results on MNIST

Experiment Setup: The MNIST [15] contains 60,000 training images and 10,000 testing images. In order to better evaluate the learning ability of our proposed EE-AE, we split the MNIST into two scenarios: (1) $1$ -set: The whole-set. (2) $\frac{1}{6}$ -set: Only contains 10000 (out of 60000) training images and 10,000 testing images. We train EE-AE and classifier under these two scenarios respectively. Our proposed EE-AE is used for unsupervised feature learning. With the learned feature representations, we train an SVM classifier to achieve the recognition. As to the SVM, we adopt 1-versus-1 setting [21, 22] for the multi-class classification. The purpose of our experiments on the MNIST is to quantitatively evaluate the effectiveness of our model.

Results & Analysis: The results are shown in Table 2. When evaluating on $1$ -set scenario, our model obtains a comparable performance with these competitors. However, when evaluating on $\frac{1}{6}$ -set, our model outperforms these competitors with great advantages. This shows that our proposed EE-AE has significant better learning ability to discover and extract useful information from data, even the data volume is relatively small. This results in a better generalization ability.

3.3 Environment and Implementation

We use Pytorch to implement our experiment. As to AEs, we apply convolution, batch normalization, ReLU and maxpooling layers to form the encoder. The convolution layers have 3 settings: 16 channels, 3x3 kernel, stride=1 and padding=1. 32 channels, 3x3 kernel, stride=1 and padding=0; 64 channels, 3x3 kernel, stride=1 and padding=0. For all the maxpooling layers, we set a 2x2 kernel. As to the decoder, we apply de-convolutional layers which are exactly the reverse of convolution to obtain up-exampled feature map. Also, we use batch normalization, ReLU and maxunpooling layers to cooperate with the network. We append an embedding layer on the top of the hidden codes with 128-dimensional features. The stochastic gradient descent (SGD) is applied for the optimization.

4 Conclusion and Future Work

In this paper, we propose a novel autoencoder based unsupervised feature learning approach called Exclusivity Enhanced Autoencoder (EE-AE). We utilize the exclusivity constraints to cooperate with feature extraction within AE. Our model can better deal with the statistical and geometrical dependence of data and results in a more robust and discriminative feature representation. Extensive experiments verified the effectiveness of our model. In the future, we plan to investigate more efficient techniques and apply the exclusivity to larger datasets.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” nature , vol. 521, no. 7553, pp. 436, 2015.
2[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR , 2018, vol. 3, p. 6.
3[3] David H Wolpert, “The status of supervised learning science circa 1994: the search for a consensus,” in The mathematics of generalization , pp. 1–10. CRC Press, 2018.
4[4] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G Lowe, “Unsupervised learning of depth and ego-motion from video,” in CVPR , 2017, vol. 2, p. 7.
5[5] Pierre Baldi, “Autoencoders, unsupervised learning, and deep architectures,” in Proceedings of ICML workshop on unsupervised and transfer learning , 2012, pp. 37–49.
6[6] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research , vol. 11, no. Dec, pp. 3371–3408, 2010.
7[7] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio, Deep learning , vol. 1, MIT press Cambridge, 2016.
8[8] Jun Xu, Lei Xiang, Qingshan Liu, Hannah Gilmore, Jianzhong Wu, Jinghai Tang, and Anant Madabhushi, “Stacked sparse autoencoder (ssae) for nuclei detection on breast cancer histopathology images,” IEEE transactions on medical imaging , vol. 35, no. 1, pp. 119–130, 2016.