Adversarial Variational Embedding for Robust Semi-supervised Learning

Xiang Zhang; Lina Yao; Feng Yuan

arXiv:1905.02361·cs.LG·May 9, 2019

Adversarial Variational Embedding for Robust Semi-supervised Learning

Xiang Zhang, Lina Yao, Feng Yuan

PDF

1 Repo

TL;DR

This paper introduces a novel adversarial variational embedding framework that combines VAE and GAN to improve semi-supervised learning by producing exclusive latent codes and meaningful data generation.

Contribution

It proposes AVAE, a new framework that leverages VAE++ and GAN to enhance semi-supervised classification with more exclusive latent representations and better data generation control.

Findings

01

Outperforms state-of-the-art semi-supervised models on four real-world datasets.

02

Produces more exclusive and meaningful latent codes for classification.

03

Enhances the quality and control of generated data.

Abstract

Semi-supervised learning is sought for leveraging the unlabelled data when labelled data is difficult or expensive to acquire. Deep generative models (e.g., Variational Autoencoder (VAE)) and semisupervised Generative Adversarial Networks (GANs) have recently shown promising performance in semi-supervised classification for the excellent discriminative representing ability. However, the latent code learned by the traditional VAE is not exclusive (repeatable) for a specific input sample, which prevents it from excellent classification performance. In particular, the learned latent representation depends on a non-exclusive component which is stochastically sampled from the prior distribution. Moreover, the semi-supervised GAN models generate data from pre-defined distribution (e.g., Gaussian noises) which is independent of the input data distribution and may obstruct the convergence and…

Figures25

Click any figure to enlarge with its caption.

Tables4

Table 1. Table 1 . Overall comparison of semi-supervised classification accuracy (%) on activity recognition. All the baselines and our approach are working on the same dataset and sharing the same experiment settings for each specific application.

Dataset	Rate (%)	Algorithm-related State-of-the-art				Application-related State-of-the-art				Ablation Study			Ours
	Rate (%)	M2	AAE	LVAE	ADGM	(Chen et al., 2018)	(Lara et al., 2012)	(Guo et al., 2016)	(Zhang et al., 2018)	VAE ( $μ$ )	VAE	VAE++	AVAE
Activity Recognition (PAMAP2)	20	64.83 $\pm$ 0.16	63.67 $\pm$ 0.23	69.82 $\pm$ 0.69	67.31 $\pm$ 0.45	72.31 $\pm$ 0.16	70.95 $\pm$ 0.08	67.31 $\pm$ 0.14	76.68 $\pm$ 0.31	58.43 $\pm$ 0.13	76.51 $\pm$ 0.53	78.12 $\pm$ 0.55	78.63 $\pm$ 0.38
	40	68.92 $\pm$ 0.23	76.83 $\pm$ 0.25	76.43 $\pm$ 0.19	78.21 $\pm$ 0.38	80.51 $\pm$ 0.21	75.38 $\pm$ 0.12	77.28 $\pm$ 0.21	80.15 $\pm$ 0.16	62.74 $\pm$ 0.12	78.78 $\pm$ 0.22	80.88 $\pm$ 0.38	81.37 $\pm$ 0.29
	60	72.35 $\pm$ 0.21	77.39 $\pm$ 0.19	78.69 $\pm$ 0.27	79.34 $\pm$ 0.29	80.29 $\pm$ 0.21	76.89 $\pm$ 0.05	79.69 $\pm$ 0.15	82.49 $\pm$ 0.33	67.85 $\pm$ 0.08	79.63 $\pm$ 0.29	81.94 $\pm$ 0.19	84.91 $\pm$ 0.17
	80	75.88 $\pm$ 0.35	78.28 $\pm$ 0.11	81.41 $\pm$ 0.23	80.38 $\pm$ 0.16	82.12 $\pm$ 0.16	79.95 $\pm$ 0.18	81.65 $\pm$ 0.09	83.56 $\pm$ 0.11	73.43 $\pm$ 0.06	81.75 $\pm$ 0.17	82.08 $\pm$ 0.26	85.56 $\pm$ 0.21
	100	77.59 $\pm$ 0.17	80.79 $\pm$ 0.14	84.39 $\pm$ 0.18	83.66 $\pm$ 0.16	83.64 $\pm$ 0.12	81.96 $\pm$ 0.11	82.38 $\pm$ 0.13	84.59 $\pm$ 0.24	76.85 $\pm$ 0.00	82.37 $\pm$ 0.25	83.29 $\pm$ 0.18	86.41 $\pm$ 0.06

Table 2. Table 2 . Overall comparison of semi-supervised classification accuracy (%) on neurological diagnosis

Dataset	Rate (%)	Algorithm-related State-of-the-art				Application-related State-of-the-art				Ablation Study			Ours
	Rate (%)	M2	AAE	LVAE	ADGM	(Ziyabari et al., 2017)	(Harati et al., 2015)	(Schirrmeister et al., 2017)	(Goodwin and Harabagiu, 2017)	VAE ( $μ$ )	VAE	VAE++	AVAE
Neurological Diagnosis (TUH)	20	71.28 $\pm$ 0.16	80.13 $\pm$ 0.95	82.31 $\pm$ 0.19	86.32 $\pm$ 0.12	87.66 $\pm$ 0.23	86.38 $\pm$ 0.36	82.19 $\pm$ 0.24	86.33 $\pm$ 0.21	80.58 $\pm$ 0.69	86.37 $\pm$ 0.24	0.86 $\pm$ 0.53	93.69 $\pm$ 0.16
	40	75.32 $\pm$ 0.16	82.95 $\pm$ 0.26	84.38 $\pm$ 0.16	86.99 $\pm$ 0.05	89.25 $\pm$ 0.19	91.58 $\pm$ 0.35	84.21 $\pm$ 0.08	89.25 $\pm$ 0.34	81.35 $\pm$ 0.24	89.69 $\pm$ 0.27	91.28 $\pm$ 0.25	94.32 $\pm$ 0.28
	60	76.32 $\pm$ 0.29	86.21 $\pm$ 0.52	87.51 $\pm$ 0.26	87.65 $\pm$ 0.16	91.28 $\pm$ 0.37	92.58 $\pm$ 0.26	85.36 $\pm$ 0.32	90.38 $\pm$ 0.24	82.59 $\pm$ 0.63	90.58 $\pm$ 0.27	92.87 $\pm$ 0.31	95.21 $\pm$ 0.21
	80	79.65 $\pm$ 0.37	88.53 $\pm$ 0.28	89.56 $\pm$ 0.25	88.05 $\pm$ 0.12	92.59 $\pm$ 0.26	93.25 $\pm$ 0.31	85.16 $\pm$ 0.24	91.59 $\pm$ 0.16	83.21 $\pm$ 0.21	91.69 $\pm$ 0.35	93.96 $\pm$ 0.28	97.86 $\pm$ 0.26
	100	82.59 $\pm$ 0.31	89.58 $\pm$ 0.25	90.25 $\pm$ 0.21	88.65 $\pm$ 0.26	93.32 $\pm$ 0.18	94.29 $\pm$ 0.25	86.42 $\pm$ 0.26	92.4 $\pm$ 0.25	84.21 $\pm$ 0.65	92.38 $\pm$ 0.41	94.65 $\pm$ 0.24	98.13 $\pm$ 0.32

Table 3. Table 3 . Overall comparison of semi-supervised classification accuracy (%) on image classification

Dataset	Rate (%)	Algorithm-related State-of-the-art				Application-related State-of-the-art				Ablation Study			Ours
	Rate (%)	M2	AAE	LVAE	ADGM	(Odena, 2016)	(Springenberg, 2016)	(Weston et al., 2012)	(Miyato et al., 2018)	VAE ( $μ$ )	VAE	VAE++	AVAE
Image Classification (MNIST)	20	93.22 $\pm$ 0.62	90.25 $\pm$ 0.25	93.25 $\pm$ 0.26	89.61 $\pm$ 0.27	95.23 $\pm$ 0.34	94.25 $\pm$ 0.13	94.58 $\pm$ 0.25	92.96 $\pm$ 0.28	91.58 $\pm$ 0.24	92.31 $\pm$ 0.53	93.59 $\pm$ 0.31	95.12 $\pm$ 0.19
	40	93.25 $\pm$ 0.34	93.21 $\pm$ 0.23	93.28 $\pm$ 0.46	91.58 $\pm$ 0.25	95.27 $\pm$ 0.53	95.56 $\pm$ 0.08	95.21 $\pm$ 0.26	93.21 $\pm$ 0.56	93.65 $\pm$ 0.21	94.21 $\pm$ 0.19	94.68 $\pm$ 0.28	96.43 $\pm$ 0.35
	60	96.24 $\pm$ 0.51	96.35 $\pm$ 0.27	95.34 $\pm$ 0.21	93.21 $\pm$ 0.34	96.38 $\pm$ 0.22	96.54 $\pm$ 0.08	96.48 $\pm$ 0.32	96.28 $\pm$ 0.57	94.89 $\pm$ 0.21	95.34 $\pm$ 0.14	96.42 $\pm$ 0.25	97.21 $\pm$ 0.21
	80	98.19 $\pm$ 0.25	95.32 $\pm$ 0.37	96.11 $\pm$ 0.52	95.01 $\pm$ 0.15	97.82 $\pm$ 0.11	97.21 $\pm$ 0.13	97.86 $\pm$ 0.34	97.63 $\pm$ 0.15	96.78 $\pm$ 0.25	97.63 $\pm$ 0.15	98.71 $\pm$ 0.16	99.79 $\pm$ 0.12
	100	98.65 $\pm$ 0.21	0.98.25 $\pm$ 0.61	96.35 $\pm$ 0.26	95.38 $\pm$ 0.82	99.21 $\pm$ 0.26	98.64 $\pm$ 0.27	99.06 $\pm$ 0.22	98.53 $\pm$ 0.17	97.41 $\pm$ 0.18	98.35 $\pm$ 0.09	99.67 $\pm$ 0.23	99.85 $\pm$ 0.11

Table 4. Table 4 . Overall comparison of semi-supervised classification accuracy (%) on recommender system

Dataset	Rate (%)	Algorithm-related State-of-the-art				Application-related State-of-the-art				Ablation Study			Ours
	Rate (%)	M2	AAE	LVAE	ADGM	(Pazzani and Billsus, 2007)	(Rendle, 2012)	(He and Chua, 2017)	(Chen et al., 2017)	VAE ( $μ$ )	VAE	VAE++	AVAE
Recommender System (Yelp)		66.42 $\pm$ 0.17	58.27 $\pm$ 0.35	66.35 $\pm$ 0.36	54.27 $\pm$ 0.38	40.55 $\pm$ 0.27	47.58 $\pm$ 0.36	65.99 $\pm$ 0.62	66.21 $\pm$ 0.24	64.28 $\pm$ 0.12	64.39 $\pm$ 0.62	65.58 $\pm$ 0.37	70.19 $\pm$ 0.87
	20	69.36 $\pm$ 0.37	61.55 $\pm$ 0.62	68.16 $\pm$ 0.24	55.35 $\pm$ 0.26	40.28 $\pm$ 0.32	48.65 $\pm$ 0.27	67.53 $\pm$ 0.31	66.59 $\pm$ 0.29	64.37 $\pm$ 0.25	67.23 $\pm$ 0.95	71.05 $\pm$ 0.29	72.21 $\pm$ 0.35
	40	72.58 $\pm$ 0.19	62.15 $\pm$ 0.39	68.59 $\pm$ 0.93	57.63 $\pm$ 0.23	42.15 $\pm$ 0.16	50.95 $\pm$ 0.24	66.58 $\pm$ 0.29	67.95 $\pm$ 0.38	67.56 $\pm$ 0.35	69.58 $\pm$ 0.37	72.19 $\pm$ 0.62	75.34 $\pm$ 0.35
	60	72.39 $\pm$ 0.64	62.89 $\pm$ 0.62	74.28 $\pm$ 0.37	58.34 $\pm$ 0.15	43.21 $\pm$ 0.15	52.15 $\pm$ 0.38	67.65 $\pm$ 0.31	68.23 $\pm$ 0.15	69.25 $\pm$ 0.18	71.39 $\pm$ 0.56	73.21 $\pm$ 0.58	78.54 $\pm$ 0.38
	80	74.58 $\pm$ 0.62	63.51 $\pm$ 0.86	72.59 $\pm$ 0.36	59.58 $\pm$ 0.23	45.86 $\pm$ 0.22	54.10 $\pm$ 0.12	68.03 $\pm$ 0.17	70.61 $\pm$ 0.25	73.24 $\pm$ 0.68	73.28 $\pm$ 0.69	76.53 $\pm$ 0.28	79.38 $\pm$ 0.59

Equations25

z_{s} = μ_{x} + σ_{x} * ε

z_{s} = μ_{x} + σ_{x} * ε

\overset{p}{ˉ} (z_{s}) = N (z_{s} ∣ 0, I)

\overset{p}{ˉ} (z_{s}) = N (z_{s} ∣ 0, I)

p_{θ_{e n}} (z_{I} ∣ x) = f (z_{I}; x, θ_{e n})

p_{θ_{e n}} (z_{I} ∣ x) = f (z_{I}; x, θ_{e n})

z_{s} = μ (z_{I}) + σ (z_{I}) * ε

z_{s} = μ (z_{I}) + σ (z_{I}) * ε

p_{θ_{d e}} (x^{'} ∣ z_{s}) = f^{'} (x^{'}; z_{s}, θ_{d e})

p_{θ_{d e}} (x^{'} ∣ z_{s}) = f^{'} (x^{'}; z_{s}, θ_{d e})

L_{\scaleto V A E 3 pt} = - E_{z_{s} \sim p_{θ_{e n}} (z_{s} ∣ x)} [lo g p_{θ_{d e}} (x^{'} ∣ z_{s})] + K L (p_{θ_{e n}} (z_{s} ∣ x) ∣∣ \overset{p}{ˉ} (z_{s}))

L_{\scaleto V A E 3 pt} = - E_{z_{s} \sim p_{θ_{e n}} (z_{s} ∣ x)} [lo g p_{θ_{d e}} (x^{'} ∣ z_{s})] + K L (p_{θ_{e n}} (z_{s} ∣ x) ∣∣ \overset{p}{ˉ} (z_{s}))

z_{s} \leftarrow μ (z_{I}), σ (z_{I}), ε

z_{s} \leftarrow μ (z_{I}), σ (z_{I}), ε

q_{φ} (y_{\scaleto G A N 3 pt} ∣ z_{\scaleto G A N 3 pt}) = h (y_{\scaleto G A N 3 pt}; z_{\scaleto G A N 3 pt}, φ)

q_{φ} (y_{\scaleto G A N 3 pt} ∣ z_{\scaleto G A N 3 pt}) = h (y_{\scaleto G A N 3 pt}; z_{\scaleto G A N 3 pt}, φ)

L_{\scaleto l ab e l 5 pt} = - E_{z_{\scaleto G A N 3 pt}, y_{\scaleto G A N 3 pt} \sim p_{j}} [l o g q_{φ} (y_{\scaleto G A N 3 pt} ∣ z_{\scaleto G A N 3 pt}, y_{\scaleto G A N 3 pt} < K + 1)]

L_{\scaleto l ab e l 5 pt} = - E_{z_{\scaleto G A N 3 pt}, y_{\scaleto G A N 3 pt} \sim p_{j}} [l o g q_{φ} (y_{\scaleto G A N 3 pt} ∣ z_{\scaleto G A N 3 pt}, y_{\scaleto G A N 3 pt} < K + 1)]

L_{\scaleto u n l ab e l 5 pt} = - E_{z_{\scaleto G A N 3 pt} \sim p_{θ_{e n}} (z_{I} ∣ x)} [l o g (1 - q_{φ} (y_{\scaleto G A N 3 pt} = K + 1∣ z_{\scaleto G A N 3 pt}))]

L_{\scaleto u n l ab e l 5 pt} = - E_{z_{\scaleto G A N 3 pt} \sim p_{θ_{e n}} (z_{I} ∣ x)} [l o g (1 - q_{φ} (y_{\scaleto G A N 3 pt} = K + 1∣ z_{\scaleto G A N 3 pt}))]

\indent \indent \indent \indent - E_{z_{\scaleto G A N 3 pt} \sim p_{θ_{e n}} (z_{s} ∣ x)} [l o g (q_{φ} (y_{\scaleto G A N 3 pt} = K + 1∣ z_{\scaleto G A N 3 pt}))]

L_{\scaleto G A N 3 pt} = w_{1} * f l a g * L_{\scaleto l ab e l 5 pt} + w_{2} * (1 - f l a g) * L_{\scaleto u n l ab e l 5 pt}

L_{\scaleto G A N 3 pt} = w_{1} * f l a g * L_{\scaleto l ab e l 5 pt} + w_{2} * (1 - f l a g) * L_{\scaleto u n l ab e l 5 pt}

f l a g = {10 l ab e l l e d u n l ab e l l e d

f l a g = {10 l ab e l l e d u n l ab e l l e d

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiangzhang1015/Adversarial-Variational-Semi-supervised-Learning
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSolana Customer Service Number +1-833-534-1729 · Convolution · USD Coin Customer Service Number +1-833-534-1729 · Dogecoin Customer Service Number +1-833-534-1729

Full text

Adversarial Variational Embedding for Robust Semi-supervised Learning

Xiang Zhang, Lina Yao, Feng Yuan

University of New South Wales, Sydney, Australia

[email protected], [email protected], [email protected]

(2019)

Abstract.

Semi-supervised learning is sought for leveraging the unlabelled data when labelled data is difficult or expensive to acquire. Deep generative models (e.g., Variational Autoencoder (VAE)) and semi-supervised Generative Adversarial Networks (GANs) have recently shown promising performance in semi-supervised classification for the excellent discriminative representing ability. However, the latent code learned by the traditional VAE is not exclusive (repeatable) for a specific input sample, which prevents it from excellent classification performance. In particular, the learned latent representation depends on a non-exclusive component which is stochastically sampled from the prior distribution. Moreover, the semi-supervised GAN models generate data from pre-defined distribution (e.g., Gaussian noises) which is independent of the input data distribution and may obstruct the convergence and is difficult to control the distribution of the generated data. To address the aforementioned issues, we propose a novel Adversarial Variational Embedding (AVAE) framework for robust and effective semi-supervised learning to leverage both the advantage of GAN as a high quality generative model and VAE as a posterior distribution learner. The proposed approach first produces an exclusive latent code by the model which we call VAE++, and meanwhile, provides a meaningful prior distribution for the generator of GAN. The proposed approach is evaluated over four different real-world applications and we show that our method outperforms the state-of-the-art models, which confirms that the combination of VAE++ and GAN can provide significant improvements in semi-supervised classification.

Variational Autoencoder, Generative Adversarial Networks, Representation Learning, Semi-supervised Classification

††journalyear: 2019††copyright: acmcopyright††conference: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 4–8, 2019; Anchorage, AK, USA††booktitle: The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’19), August 4–8, 2019, Anchorage, AK, USA††price: 15.00††doi: 10.1145/3292500.3330966††isbn: 978-1-4503-6201-6/19/08

1. Introduction

Semi-supervised learning from data is one of the fundamental challenges in artificial intelligence, which considers the problem when only a subset of the observations has corresponding class labels (Ghasedi Dizaji et al., 2018). This issue is of immense practical interest in a broad range of application scenarios, such as abnormal activity detection (Yao et al., 2016), neurological diagnosis (Peng et al., 2016), computer vision (Gong et al., 2016), and recommender systems (Yang et al., 2017). In these scenarios, it is easy to obtain abundant observations but expensive to gather the corresponding class labels. Among existing approaches, Variational Autoencoders (VAEs) (Kingma et al., 2014; Sønderby et al., 2016) have recently achieved state-of-the-art performance in semi-supervised learning.

VAE models provide a general framework for learning latent representations: a model is specified by a joint probability distribution both over the data and over latent random variables, and a representation can be found by considering the posterior on latent variables given specific data (Narayanaswamy et al., 2017). The learned representations can not only be used for generation but also for classification. For instance, VAE provides a latent feature representation of the input observations, where a separate classifier can be thereafter trained using these representations. The high quality of latent representations enables accurate classification, even with a limited number of labels. A number of studies have applied VAE in semi-supervised classification in the computer vision area (Kingma et al., 2014; Makhzani et al., 2015; Narayanaswamy et al., 2017).

1.1. Motivation

Why we propose the VAE++ . One major challenge faced by the existing VAE-based semi-supervised methods is that the latent representations are stochastically sampled from the prior distribution instead of being directly rendered from the explicit observations. In particular, as shown in Figure 1(a), the learned latent representations $\bm{z}_{s}$ are randomly sampled from a multivariate Gaussian distribution (see Equation 1). Thus, for a specific sample, the corresponding latent representation is not exclusive (i.e., the representation is not repeatable in different runnings), which makes it inappropriate for classification. To solve this problem, in the latent space, we propose a new variable $\bm{z}_{I}$ (see Figure 1(b)) which is directly learned from the input data. The exclusive latent code $\bm{z}_{I}$ is guaranteed to keep invariant for a specific input $\bm{x}$ in different runnings. The modified VAE is called VAE++. In addition, the learned expectation $\bm{\mu}$ only contains a part of information of the input observations, which is not enough to represent the observations in classification task, even though $\bm{\mu}$ is exclusive111For the same reason, $\bm{\sigma}$ can not be used as the exclusive code.. The comparison of performance among $\bm{z}_{I}$ , $\bm{z}_{s}$ and $\bm{\mu}$ will be presented in Section 4.

Why VAE++ needs the semi-supervised GAN. In the proposed VAE++, it is necessary to reduce the information loss between the two latent representations $\bm{z}_{I}$ and $\bm{z}_{s}$ to guarantee the learned $\bm{z}_{I}$ is representative. The commonly used constraints between two distributions (e.g., Kullback-Leibler divergence) can only utilize the information of the observations but fail to exploit the information of labels. In this paper, we use a novel approach to take advantage of both unlabelled and labelled data by jointly training the VAE++ and a semi-supervised GAN.

Why semi-supervised GAN needs the VAE++. GAN based approaches (Odena, 2016; Salimans et al., 2016) have recently shown promising results in semi-supervised learning. The semi-supervised GAN trains a generative model and a discriminator with inputs belonging to one of $K$ classes. Different from the regular GAN, the semi-supervised GAN requires the discriminator to make a $K+1$ class prediction with an extra class added, corresponding to the generated fake samples. In this way, the observations’ properties can be used to improve decision boundaries and allow for more accurate classification than using the labelled data alone. However, the generated samples are sampled from pre-defined distribution (e.g., Gaussian noise) (Cao et al., 2018). Such pre-defined prior distributions are often independent from the input data distributions and may obstruct the convergence and can not guarantee the distribution of the generated data. This drawback can be amended by gearing with VAE++ which can provide a meaningful prior distribution that can represent the distribution of the input data.

We introduce a recipe for semi-supervised learning, a robust Adversarial Variational Embedding (AVAE) framework, which learns the exclusive latent representations by combining VAE and semi-supervised GAN. To utilize the generative ability of GAN and the distribution approximating power of VAE, the proposed approach employs GAN to encourage VAE for the aim of learning the more robust and informative latent code. We present the framework in the context of VAE, adding a new exclusive code in latent space which is directly rendered from the data space. The generator in VAE++ also works as a generator of GAN. Both the exclusive code (marked as real) and the generated representation (marked as fake) are fed into the discriminator in order to force them to have similar distribution (Mirza and Osindero, 2014).

1.2. Contribution

Although a small set of models combining VAE and GAN have been previously explored, they are all focused on the generation perspective. To our knowledge, we are in the first batch of work that focuses on classification by aggregating VAE and GAN. We mark the following contributions:

•

We present a novel semi-supervised Adversarial Variational Embedding approach to harness the deep generative model and generative adversarial networks collectively under a trainable unified framework. The reproducible codes and datasets are publicly available222https://github.com/xiangzhang1015/Adversarial-Variational-Semi-supervised-Learning.

•

We propose a new structure, VAE++, to automatically learn an exclusive latent code for accurate classification. A novel semi-supervised GAN, which exploits both the unlabelled data distribution and categorical information, is proposed to gear with the VAE++ in order to encourage the VAE++ to learn a more effective and robust exclusive code.

•

We evaluate the proposed approach over four real-world applications (activity reconstruction, neurological diagnosis, image classification, and recommender system). The results demonstrate that our approach outperforms all the state-of-the-art methods.

2. Related Work

There are a host of studies that have been investigated to apply VAE for semi-supervised learning (Kingma et al., 2014; Narayanaswamy et al., 2017; Sønderby et al., 2016; Maaløe et al., 2016). (Kingma et al., 2014) explores semi-supervised learning with deep generative models by building two VAE-based deep generative models for latent representation extraction. Afterward, (Narayanaswamy et al., 2017) attempts to learn disentangled representations that encode distinct aspects of the data into separate variables. However, in all the existing semi-supervised VAE models, the learned representations do not only depend on the posterior distribution but also on the latent random variables. It is necessary that learning the exclusive code which is only related to the posterior distribution for the specified data.

Another recent arising semi-supervised method is semi-supervised GAN (Odena, 2016; Springenberg, 2016; Radford et al., 2016). SGAN (Odena, 2016) extends GAN to the semi-supervised context by forcing the discriminator network to output class labels. The CatGAN (Springenberg, 2016) modifies the objective function to take into account the mutual information between observed examples and their predicted class distributions. In the above methods, the generator chooses simple factored continuous noise which is independent from the input data distribution, for generation. As a result, it is possible that the noise will be used by the generator in a highly entangled way, increasing the difficulty to control the distribution of the generated data. Conditional GAN (Mirza and Osindero, 2014) and InfoGAN (Chen et al., 2016) address this drawback by utilizing external information (e.g., categorical information) as a restriction, but they both pay attention to generation or supervised classification and have limited help in semi-supervised classification.

Despite the few works attempting to combine VAE and GAN (Larsen et al., 2015; Makhzani et al., 2015; Bao et al., 2017), most of them focus on generation instead of classification. For example, the VAE/GAN (Larsen et al., 2015) and CVAE-GAN (Bao et al., 2017) employ the standard VAE to share the encoder with the generator of GAN in order to generate new observations. For semi-supervised classification, we care about the latent code instead of the observations. The Adversarial Autoencoder (AAE (Makhzani et al., 2015)) integrates VAE and GAN but only employs GAN to replace KL divergence as a penalty to impose a prior distribution on the latent code, which is a totally different direction from our work.

Summary. Unlike the existing VAE- and GAN-based studies, the proposed model 1) focuses on semi-supervised classification instead of generation; 2) attempts to learn an exclusive latent representation instead of a stochastic sampled representation; 3) works on improvement of latent space instead of data space. Moreover, the semi-supervised GAN in our work partly adopts the improved GAN (Salimans et al., 2016), but there are a number of differences: 1) (Salimans et al., 2016) adopts the semi-supervised strategy for classification while we adopt this strategy as a constraint to reduce information loss in the transformation from $z_{I}$ to $z_{s}$ in order to force the proposed AVAE to learn a more robust and effective latent code; 2) (Salimans et al., 2016) employs the discriminator of GAN as the classifier while we adopt an extra non-parametric classifier since the former has poor performance in our case (take the PAMAP2 dataset as an example, (Salimans et al., 2016) and our model achieve the accuracy around 65% and 85%, respectively); 3) we employ weighted loss function to balance the significance of the unlabelled and labelled observations.

3. Methodology

Suppose the input dataset has two subsets, one of which contains labelled samples while the other contains unlabelled samples. In the former subset, the observations appear as pairs $(\bm{X}^{L},\bm{Y}^{L})=\{(\bm{x}_{1}^{L},\bm{y}_{1}),(\bm{x}_{2}^{L},\bm{y}_{2}),\cdots,(\bm{x}_{N_{L}}^{L},\bm{y}_{N_{L}})\}$ with the $i$ -th observation $\bm{x}_{i}^{L}\in\mathbb{R}^{M}$ and the corresponding one-hot label $\bm{y}_{i}\in\mathbb{R}^{K}$ where $K$ denotes the number of classes. $N_{L}$ denotes the number of labelled observations while $M$ denotes the number of the observation dimensions. In the latter subset, only the observations $\bm{X}^{U}=\{\bm{x}_{1}^{U},\bm{x}_{2}^{U},\cdots,\bm{x}_{N_{U}}^{U}\}$ are available and $N_{U}$ denotes the number of unlabelled observations $\bm{x}_{i}^{U}\in\mathbb{R}^{M}$ . The total data size $N$ equals to the sum of $N_{L}$ and $N_{U}$ . In terms of effective classification, we attempt to learn a latent representation which is rich of distinguishable information. Then the learned representations can be fed into a classifier for recognition. In this paper, we mainly focus on the latent code learning.

In the semi-supervised learning, due to the lack of labelled observations, it is significant to learn latent variable distribution based on the observations without label333For simplification, we omit the index and directly use variable $\bm{x}$ to denote observations.. Thus, we are required to build an encoder to provide an embedding or feature representation which allows accurate classification even with limited observations.

3.1. VAE++

The VAE is demonstrated to provide a latent feature representation for semi-supervised learning (Kingma et al., 2014; Narayanaswamy et al., 2017), compared to a linear embedding method or a regular autoencoder. The VAE maps the input observation $\bm{x}$ to a compressed code $\bm{z}_{s}$ , and decodes it to reconstruct the observation. The latent representation is calculated through the reparameterization trick (Kingma and Welling, 2013):

[TABLE]

with $\bm{\varepsilon}\sim\mathcal{N}(0,1)$ to impose the posterior distribution of the latent code on $p(\bm{z}_{s}|x)\sim\mathcal{N}(\mu_{\bm{x}},\sigma^{2}_{\bm{x}})$ . $\mu_{\bm{x}}$ and $\sigma_{\bm{x}}$ denote the expectation and standard deviation of the posterior distribution of $\bm{z}_{s}$ , which are learned from $\bm{x}$ . For the efficient generation and reconstruction, VAE imposes the code $\bm{z}_{s}$ on a prior Gaussian distribution:

[TABLE]

Through minimizing the reconstruction error between $\bm{x}$ and $\bm{x}^{\prime}$ and restricting the distribution of $\bm{z}_{s}$ to approximate the prior distribution $\bar{p}(\bm{z}_{s})$ , VAE is supposed to learn the representative latent code $\bm{z}_{s}$ which can be used for classification or generation.

Due to the strong feature representation ability, VAE has been employed for feature extraction and semi-supervised learning (Abbasnejad et al., 2017; Xu et al., 2017; Walker et al., 2016; Narayanaswamy et al., 2017). However, one limitation of the standard VAE is that the learned latent code $\bm{z}_{s}=g(\mu_{\bm{x}},\sigma_{\bm{x}},\bm{\varepsilon})$ , as shown in Equation (1), is not exclusive. In other words, for a specific observation $\bm{x}$ and a fixed embedding model $p(\bm{z}_{s}|\bm{x})$ , the corresponding latent code $\bm{z}_{s}$ is not exclusive as it contains a stochastic variable $\bm{\varepsilon}$ which is randomly sampled from the prior distribution $\bar{p}(\bm{z}_{s})$ . For instance, in a pre-trained fixed VAE encoder, the specific input $\bm{x}$ will lead to a variety of $\bm{z}_{s}$ in different running. At high level, the latent code $\bm{z}_{s}$ is determined by two factors: the prior distribution of observation $\bar{p}(\bm{x})$ which affects $\bm{z}_{s}$ through the learned $\mu_{\bm{x}}$ and $\sigma_{\bm{x}}$ , and the stochastically sampled data $\bm{\varepsilon}$ . However, the stochastically sampled latent code is unstable and will corrupt the features for classification. Furthermore, the posterior distribution of $\bm{z}_{s}$ is forced to approximate the manually set prior distribution (commonly Normal Gaussian distribution), which inevitably leads to information loss.

In order to completely sidestep the above-mentioned issue, in this paper, we propose a novel VAE++ model to learn an exclusive latent code $\bm{z}_{I}$ . The VAE++ contains three key components: the encoder, the generator, and the decoder (see Figure 2). The encoder transforms the observation into a latent code $\bm{z}_{I}\in\mathbb{R}^{D}$ which is directly determined by the input $\bm{x}$ . $D$ denotes the dimension of $\bm{z}_{I}$ . We learn the:

[TABLE]

where $f$ denotes a non-linear transformation while $\bm{\theta}_{en}$ denotes encoder parameters. The non-linear transformation $f$ is generally chosen as a deep neural network for the excellent ability of non-linear approximation. Then, in the generator, we measure the expectation $\mu(\bm{z}_{I})$ and the standard derivation $\sigma(\bm{z}_{I})$ from the latent code $\bm{z}_{I}$ and update Equation (1). The generated variable $\bm{z}_{s}$ can be calculated by:

[TABLE]

At last, the decoder is employed to reconstruct the sample:

[TABLE]

where $f^{\prime}$ denotes another non-linear rendering, called decoder, with parameters $\bm{\theta}_{de}$ and $\bm{x}^{\prime}$ denotes the reconstructed observation.

The loss function of VAE++ can be calculated by:

[TABLE]

The first component is the reconstruction loss, which equals to the expected negative log-likelihood of the observation. This term encourages the decoder to reconstruct the observation $\bm{x}$ based on the sampling code $\bm{z}_{s}$ which is under Gaussian distribution. The lower reconstruction error indicates the encoder learned a better latent representation. The second component is the Kullback-Leibler divergence which measures the distance between the prior distribution of the latent code $\bar{p}(\bm{z}_{s})$ and the posterior distribution $p(\bm{z}_{s}|\bm{x})$ . This divergence reflects the information loss when we use $p(\bm{z}_{s}|\bm{x})$ to represent $\bar{p}(\bm{z}_{s})$ .

In the latent space of the novel VAE++, there are two compressed informative codes $\bm{z}_{I}$ and $\bm{z}_{s}$ . The former represents directly-encoded $\bm{x}$ whilst the latter is stochastically sampled from the posterior distribution , which makes the former more suitable for classification. Therefore, we choose $\bm{z}_{I}$ as the compressed latent code in VAE++ instead of the $\bm{z}_{s}$ in standard VAE.

From equation (2), we can observe that the expectation and standard deviation of $\bm{z}_{s}$ and $\bm{z}_{I}$ are invariant. In particular, for a specific sample $\bm{x}_{i}$ , the corresponding $\bm{z}_{si}$ and $\bm{z}_{Ii}$ have the same statistical characteristics. Thus, we have

[TABLE]

which indicates that the generated $\bm{z}_{s}$ is affected by both the distribution (or statistic characteristics) of $\bm{z}_{I}$ and the prior distribution $\bar{p}(\bm{z}_{s})$ (or $\bm{\varepsilon}$ ). In summary, the $\bm{z}_{s}$ inherits the statistical characteristics of $\bm{z}_{I}$ .

3.2. Adversarial Variational Embedding

One significant sufficient condition of a well-trained VAE++ is less information loss in the transformation from $\bm{z}_{I}$ to $\bm{z}_{s}$ to guarantee the learned $\bm{z}_{I}$ is representative. As mentioned before, the information in $\bm{z}_{s}$ is partly inherited from $\bm{z}_{I}$ and the other part is randomly sampled from the prior distribution $\bar{p}(\bm{z}_{s})$ . Since the conditional distribution $p_{\bm{\theta}_{en}}(\bm{z}_{I}|\bm{x})$ has a better description of the input observation $\bm{x}$ , we attempt to increase the proportion of inherited part and decrease the proportion of stochastically sampled part.

As shown in Figure 2, in the proposed AVAE the generator $\bm{G}$ generates $\bm{z}_{s}$ based on the joint probability $p(\mu,\sigma,\bar{p}(\bm{z}_{s}))$ instead of the noise in standard GAN. The $\bm{z}_{s}$ is regarded as ‘fake’ while $\bm{z}_{I}$ is marked as ‘real’. Specifically, for the labelled observations $\bm{x}^{L}$ , VAE++ encodes the input to the latent code $\bm{z}^{L}_{I}\in\mathbb{R}^{D}$ and generates $\bm{z}^{L}_{s}\in\mathbb{R}^{D}$ ; similarly, for unlabelled observations $\bm{x}^{U}$ , we have $\bm{z}^{U}_{I}\in\mathbb{R}^{D}$ and generates $\bm{z}^{U}_{s}\in\mathbb{R}^{D}$ . To exploit the information of the labels, we extend the $\bm{y}\in\mathbb{R}^{K}$ which has $K$ possible classes to $\bm{y}_{\scaleto{GAN}{3pt}}\in\mathbb{R}^{K+1}$ which has $K+1$ possible classes by regarding the generated fake samples $\bm{z}_{s}$ as the $(K+1)$ -th class (Salimans et al., 2016; Odena, 2016). In the VAE++, the unspecified $\bm{z}_{s}$ denotes both $\bm{z}^{L}_{s}$ and $\bm{z}^{U}_{s}$ whenever we don’t care whether the observation is labelled or not. This rule also applies to $\bm{z}_{I}$ . Similarly, we use $\bm{z}_{\scaleto{GAN}{3pt}}$ to denote the input of the discriminator $\bm{D}$ , which contains both $\bm{z}_{I}$ and $\bm{z}_{s}$ . The discriminator can be described by

[TABLE]

where $\bm{\varphi}$ denotes the parameters of $\bm{D}$ while $h$ denotes the non-linear transformation which is implemented by a Convolutional Neural Networks (CNN) (Krizhevsky et al., 2012) in this paper. Therefore, we can use $\bm{q}_{\bm{\varphi}}(\bm{y}_{\scaleto{GAN}{3pt}}=K+1|\bm{z}_{\scaleto{GAN}{3pt}})$ to supply the probability where $\bm{z}_{\scaleto{GAN}{3pt}}$ is fake (from $\bm{z}_{s}$ ) and use $\bm{q}_{\bm{\varphi}}(\bm{y}_{\scaleto{GAN}{3pt}}|\bm{z}_{\scaleto{GAN}{3pt}},\bm{y}_{\scaleto{GAN}{3pt}}<K+1)$ to supply the probability where $\bm{z}_{\scaleto{GAN}{3pt}}$ is real ((from $\bm{z}_{I}$ )) and is correctly classified.

For the labelled input, same as supervised learning, the discriminator is supposed to not only tell whether the input $\bm{z}_{\scaleto{GAN}{3pt}}$ is real or generated, but also classify it into the correct class. Therefore, we have the supervised loss function

[TABLE]

where $p_{j}$ denotes the joint probability.

For the unlabelled input, we only require the discriminator to perform a binary classification: the input is real or fake. The former probability can be calculated by $1-\bm{q}_{\bm{\varphi}}(\bm{y}_{\scaleto{GAN}{3pt}}=K+1|\bm{z}_{\scaleto{GAN}{3pt}})$ whilst the latter can be calculated by $\bm{q}_{\bm{\varphi}}(\bm{y}_{\scaleto{GAN}{3pt}}=K+1|\bm{z}_{\scaleto{GAN}{3pt}})$ . Thus, the unsupervised loss function:

[TABLE]

In summary, the final loss function of the discriminator

[TABLE]

where $w_{1},w_{2}$ are weights and $flag$ is a switch function

[TABLE]

If the specific observation is labelled, we calculate the labelled loss function. Otherwise, we calculate the unlabelled loss function. From empirical experiments, we observe that the $\mathcal{L}_{\scaleto{unlabel}{5pt}}$ is much easier to converge than $\mathcal{L}_{\scaleto{label}{5pt}}$ and the real/fake classification accuracy is much higher than the $K$ classes classification accuracy. To encourage the optimizer to focus on the former part which is more difficult to converge, we set $w_{1}=0.9$ and $w_{2}=0.1$ .

The discriminator receives $\bm{z}_{\scaleto{GAN}{3pt}}$ as input and extracts the dependencies through CNN filters. Two fully connected layers follow the convolutional layer for dimension reduction. At last, a softmax layer is employed to work on the low-dimension features to estimate the log normalization of the categorical probability distribution which is output as $\bm{y}_{\scaleto{GAN}{3pt}}$ .

The overall aim of the proposed AVAE (as described in Algorithm 1) is to train a robust and effective semi-supervised embedding method. The VAE loss $\mathcal{L}_{\scaleto{VAE}{3pt}}$ and the GAN loss $\mathcal{L}_{\scaleto{GAN}{3pt}}$ are trained simultaneously by the Adam optimizer. After convergence, the compressed representative code $\bm{z}_{I}$ is fed into a non-parametric nearest neighbors classifier for recognition.

4. Experiments

In this section, we demonstrate the effectiveness and validation of the proposed method over four applications.

4.1. Activity Recognition

4.1.1. Experiment Setup

Activity recognition is an important area in data mining. We evaluate our approach over the well-known PAMAP2 dataset (Fida et al., 2015), which is collected by 9 participants (8 males and 1 female) aged $27\pm 3$ . We select 5 most commonly used activities (Cycling, standing, walking, lying, and running, labelled from 0 to 4) as a subset for evaluation. For each subject, there are 12,000 instances. The activity is measured by 3 Inertial Measurement Units (IMU) attached to the participants’ wrist, chest, and the outer ankle. Each IMU includes 13 dimensions: two 3-axis accelerometers, one 3-axis gyroscopes, one 3-axis magnetometers and one thermometer. The experiments are performed by a Leave-One-Subject-Out strategy to ensure the practicality.

The time window is set as 10 with 50% overlapping. The dataset is split into a training set (80% proportion) and a testing set (20% proportion). For semi-supervised learning, the training dataset contains both labelled observations and unlabelled observations. We present a term called‘supervision rate’ as a handle on the relative weight between the supervised and unsupervised terms. For the given number of labelled observations $N^{L}$ and the number of unlabelled observations $N^{U}$ , the supervision rate $\gamma$ is defined by $N^{L}/(N^{L}+N^{U})$ .

4.1.2. Parameter Setting

We introduce the default parameter settings and the settings in other applications keep the same if not mentioned. The input observations are first normalized by Z-score normalization and fed to the input layer of the unsupervised VAE++. The neuron amount in the first hidden layer, which is denoted by $\bm{z}_{I}$ , is a quarter of $M$ . The second hidden layer contains 2 components which represent the expectation and the standard deviation respectively. The third hidden layer $\bm{z}_{s}$ has the sample shape with $\bm{z}_{I}$ . An Adam optimizer with a learning rate of $0.00001$ is employed to minimize the loss function of VAE++.

After each epoch of VAE++, the first hidden layer $\bm{z}_{I}$ and the third hidden layer $\bm{z}_{s}$ are labelled as ‘real’ and ‘fake’, respectively, and fed to the discriminator $\mathcal{D}$ . The discriminator contains one convolutional layer followed by two fully-connected layers. There is a softmax layer to obtain the categorical probability before the output layer which has $K+1$ neurons. The convolutional layer has 10 filters which have shape $[2,2]$ and the stride size $[1,1]$ . The padding method of the convolutional operation is set as ‘same’ while the activation function is ReLU. The following hidden layer has $M/4$ neurons and the sigmoid activation function. The loss function is optimized by Adam update rule with learning rate of $0.0001$ . The object functions of the VAE++ and the discriminator are trained simultaneously. After the convergence of the proposed method, the semi-supervised learned latent representation $\bm{z}_{I}$ is fed into a supervised non-parametric nearest neighbor classifiers with $k=3$ .

4.1.3. Baselines

To measure the effectiveness of the proposed method, we compare it with a set of competitive state-of-the-art models. The state-of-the-art methods are composed of two categories: algorithm-related and application-related. The former denotes other VAE/GAN based semi-supervised classification algorithms, which are the same for all the applications. The comparison is used to demonstrate our framework has the highest semi-supervised representation learning ability. The latter denotes the state-of-the-art models in each application, which are varied for the different applications. The comparison is used to demonstrate our work is effective in the real-world scenarios.

The algorithm-related semi-supervised learning solutions in our comparison are listed as follows:

•

M2. (Kingma et al., 2014) proposes a probabilistic model that describes the data as being generated by a latent class variable in addition to a continuous latent representation.

•

Adversarial Autoencoders (AAE). (Makhzani et al., 2015) employs the GAN to perform variational inference by matching the aggregated posterior of the hidden representation of the autoencoder.

•

Ladder Variational Autoencoders (LVAE). (Sønderby et al., 2016) proposes an inference model which recursively corrects the generative distribution by a data dependent likelihood.

•

Auxiliary Deep Generative Models (ADGM). (Maaløe et al., 2016) extends deep generative models with auxiliary variables, which improves the variational approximation.

We design ablation study to demonstrate the necessity of each key component of the proposed approach. In the ablation study, we set four control experiments with single variable among the components of AVAE. We adopt the following four methods to discover the latent representations: 1) VAE ( $\bm{\mu}$ ) with $\bm{\mu}$ as the latent representation; 2) standard VAE ( $\bm{z}_{s}$ as the latent representation); 3) VAE++ ( $\bm{z}_{I}$ as the latent representation); 4) AVAE. The extracted representations are fed into the same classifier for final classification.

The application-related state-of-the-art models on activity recognition are listed here:

•

Chen et al. (Chen et al., 2018) adopt an attention mechanism to select the most distinguishable features from the activity signals and send them to a CNN structure for classification.

•

Lara et al. (Lara et al., 2012) apply both statistical and structural detectors features to discriminate among activities.

•

Guo et al. (Guo et al., 2016) exploit the diversity of base classifiers to construct a good ensemble for multimodal activity recognition, and the diversity measure is obtained from both labelled and unlabelled data.

•

Zhang et al. (Zhang et al., 2018) combine deep learning and the reinforcement learning scheme to focus on the crucial dimensions of the signals.

4.1.4. Results and Discussion

First, we report the overall performance of all the compared algorithms. From Table 1, we can observe that the proposed approach (AVAE) outperforms all the algorithm-related and application-related state-of-the-art models, illustrating the effectiveness of the latent space in providing robust representations for easier semi-supervised classification. The advantage is demonstrated under all the supervision rates.

In Table 1, through the ablation study, it is observed that each component (especially GAN) contributes to the performance enhancement. Additionally, the proposed AVAE achieves a significant improvement which yields around $5\%$ and $3\%$ growth than the standard VAE and the VAE++ (under 60% supervision rate), respectively. This observation demonstrates that the proposed latent layer $\bm{z}_{I}$ and the adversarial training (between the discriminator and VAE++) encourages the proposed model to learn and refine the informative latent code. Take 60% supervision rate as an example, more details of the classification are shown in the confusion matrix (Figure 3(a)) and ROC curves with AUC score (Figure 4(a)).

4.2. Neurological Diagnosis

4.2.1. Experiment Setup

EEG signal collected in the unhealthy state differs significantly from the ones collected in the normal state (Adeli et al., 2007). The epileptic seizure is a common brain disorder that affects about 1% of the population and its octal state could be detected by the EEG analysis of the patient. In this application, we evaluate our framework with raw EEG data to diagnose the epileptic seizure of the patient.

We choose the benchmark dataset TUH (Obeid and Picone, 2016) for epileptic seizure diagnosis. The TUH is a neurological seizure dataset of clinical EEG recordings associated with 22 channels from a 10/20 configuration. The sampling rate is set as 250 Hz. We select 12,000 samples from each of 18 subjects. Half of the samples are labelled as epileptic seizure state (labelled as 1) and the remaining samples are labelled as normal state (labelled as 0). The experiment and parameter settings are the same as the activity recognition applications.

4.2.2. Baselines

The application-related state-of-the-art approaches in neurological diagnosis are listed here:

•

Ziyabari et al. (Ziyabari et al., 2017) adopt a hybrid deep learning architecture, including LSTM and stacked denoising Autoencoder, which integrates temporal and spatial context to detect the epileptic seizure.

•

Harati et al. (Harati et al., 2015) demonstrate that a variant of the filter bank-based approach, coupled with first and second derivatives, provides a reduction in the overall error rate.

•

Schimeister et al. (Schirrmeister et al., 2017) attempt to improve the performance of seizure detection by combining deep ConvNets with training strategies such as exponential linear units.

•

Goodwin et al. (Goodwin and Harabagiu, 2017) combine RNN with access to textual data in EEG reports in order to automatically extracting word- and report-level features and infer underspecified information from EHRs (electronic health records).

4.2.3. Results and Discussion

From Table 2, we can observe that our approach outperforms all the competitive baselines on TUH dataset. For instance, under 60% supervision level, the proposed approach achieves the highest accuracy of 95.21% which claims around 4% improvement over other methods. The corresponding confusion matrix (Figure 3(b)) and ROC curves (Figure 4(b)) infer that the normal state has higher accuracy than the seizure state. One possible reason is that the start and end stage of the seizure has similar symptoms with the normal state which may lead to misclassification.

4.3. Image Classification

4.3.1. Experiment Setup

To evaluate the representation learning ability in images, we test our approach on the benchmark dataset MNIST 444http://yann.lecun.com/exdb/mnist/. MNIST contains 60,000 handwritten digital images (50,000 for training and 10,000 for testing) with $28*28$ pixels. The labels of this dataset are from 0 to 9, corresponding to the 10 digits.

4.3.2. Parameter Settings

Images are more informative compared to other application scenarios. The encoder of AVAE is designed to be stacked by two convolutional layers. The first convolutional layer has 32 filters with shape $[3,3]$ , the stride size $[1,1]$ , ’SAME’ padding, and ReLU activation function. The followed pooling layer has $[2,2]$ window size, $[2,2]$ stride, and ’SAME’ padding. The second convolutional layer has 64 filters with $[5,5]$ . The residual parameters of the second convolutional layer and the second pooling layer are the same with the former. Similarly, the decoder contains two de-convolutional layers with the same parameter settings.

4.3.3. Baselines

We reproduce the following methods under different supervision rate for comparison:

•

Augustus (Odena, 2016) proposes a semi-supervised GAN (SGAN) by forcing the discriminator network to output class labels.

•

Springenberg (Springenberg, 2016) proposes CatGAN to modify the objective function taking into account the mutual information between observation and the prediction distribution.

•

Weston et al. (Weston et al., 2012) apply kernel methods for a nonlinear semi-supervised embedding algorithm.

•

Miyato et al. (Miyato et al., 2018) propose a regularization method based on virtual adversarial loss: a new measure of local smoothness of the conditional label distribution given the inputs.

4.3.4. Results and Discussion

As shown in Table 3, AVAE outperforms the counterparts with a slight gain with the same supervision level. The confusion matrix and ROC curves are reported in Figure 3(c) and Figure 4(c). The results show that our approach is enabled to automatically learn the discriminative features by joint training the VAE++ and the semi-supervised GAN.

4.4. Recommender System

4.4.1. Experiment Setup

We apply our framework on recommender system scenarios, in particular, a restaurant rating prediction task based on the widely used Yelp dataset.

The Yelp Dataset555https://www.yelp.com/dataset which includes 192,609 Businesses, 1,637,138 Users, and 6,685,900 Ratings. Each business has 13 attributes (like ‘near garage?’, ‘have valet?’) which can describe the quality and convenience of the business. Meanwhile, each business is rated by a series customers. The ratings range from 1 to 5, which can reflect the customers’ satisfactory degree. Our recommender task considers a unseen business’s attributes as input data and predict the possible ratings from the potential customers. If the rating is high enough, the new business will be recommended to the public.

4.4.2. Baselines

We compare our approach with the state-of-the-art recommender system models which exploit the content information of items. Since these methods are used to make rating predictions for each user-item pair, we select those users who have 200 and more ratings in the Yelp dataset, generating a set of 1,111 users. After collecting the predicted ratings for all user-item pairs, we take the average item ratings over the users, which are further rounded to serve as the predicted labels.

•

Pazzani et al. (Pazzani and Billsus, 2007) summarizes basic content-based recommendation approaches, from which we select the cosine similarity-based nearest neighbour method as our fundamental baseline.

•

Rendle (Rendle, 2012) proposes the original implementation of factorization machine(FM) which is capable of incorporating item features with explicit feedbacks. We concatenate only the item indication vector and its feature after each user indication vector following the format in (Rendle, 2012).

•

He et al. (He and Chua, 2017) enhances the original FM using deep neural networks to learn high-order interactions between different item features.

•

Chen et al. (Chen et al., 2017) applies feature- and item-level attention on item features, which is capable of emphasizing on the most important features.

4.4.3. Results and Discussion

From Table 4, we can observe that our approach outperforms both the competitive semi-supervised algorithms and the content-based recommender system state-of-the-art methods. The rating prediction details can be found in Figure 3(d) and Figure 4(d). The classification performance in recommender system is not good as in other applications. One possible reason is that the attributes data are very sparse. The experiment results illustrate that our approach is effective in recommender system scenarios.

4.5. Further Analysis

4.5.1. Supervision Rate

We conduct extensive experiments to investigate the impact of supervision rate $\bm{\lambda}$ . The supervision rate ranges from 20% to 100% with 20% interval and each setting runs for at least three times with the average accuracy recorded. The overall performance varies with the supervision rate $\bm{\lambda}$ . From Table 2 to Table 4, it is noticed that the proposed model obtains competitive performance at each supervision level.

4.5.2. Visualization

Figure 5 visualizes the raw data and the learned features on different datasets. The visualization comparison demonstrates the capability of our approach for distinguishable feature learning.

4.5.3. Convergence

Take PAMAP2 as an example, Figure 6 presents the relationship between the loss function values and the epoch numbers. The VAE++ loss includes the reconstruction loss and the KL-divergence whilst the loss of the discriminator in GAN includes labelled loss and unlabelled loss (with weights 0.9 and 0.1, respectively). We can observe that the proposed method shows good convergence property as it stablizes in around 200 epochs.

5. Conclusion

In this paper, we present an effective and robust semi-supervised latent representation framework, AVAE, by proposing a modified VAE model and integration with generative adversarial networks. The VAE++ and GAN share the same generator. In order to automatically learn the exclusive latent code, in the VAE++, we explore the latent code’s posterior distribution and then stochastically generate a latent representation based on the posterior distribution. The discrepancy between the learned exclusive latent code and the generated latent representation is constrained by semi-supervised GAN. The latent code of AVAE is finally served as the learned feature for classification. The proposed approach is evaluated on four real-world applications and the results demonstrate the effectiveness and robustness of our model.

The hyper-parameter tuning (not presented in this paper due to space limitation) in our model is required for different datasets in various applications. One of our future scope is to propose a more generalized framework which is not sensitive to datasets. Moreover, our model still requires adequate labelled training samples for good performance. The lower supervision rate or unsupervised learning is another major goal in the future.

6. Acknowledgement

This research was partially supported by grant ONRG NICOP N62909-19-1-2009.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Abbasnejad et al . (2017) M Ehsan Abbasnejad, Anthony Dick, and Anton van den Hengel. 2017. Infinite variational autoencoder for semi-supervised learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) . IEEE, 781–790.
3Adeli et al . (2007) Hojjat Adeli, Samanwoy Ghosh-Dastidar, and Nahid Dadmehr. 2007. A wavelet-chaos methodology for analysis of EE Gs and EEG subbands to detect seizure and epilepsy. IEEE Transactions on Biomedical Engineering 54, 2 (2007), 205–211.
4Bao et al . (2017) Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. 2017. CVAE-GAN: fine-grained image generation through asymmetric training. Co RR, abs/1703.10155 5 (2017).
5Cao et al . (2018) Jiezhang Cao, Yong Guo, Qingyao Wu, Chunhua Shen, and Mingkui Tan. 2018. Adversarial Learning with Local Coordinate Coding. The International Conference of Machine Learning (ICML) (2018).
6Chen et al . (2017) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In SIGIR . ACM, 335–344.
7Chen et al . (2018) Kaixuan Chen, Lina Yao, Xianzhi Wang, Dalin Zhang, Tao Gu, Zhiwen Yu, and Zheng Yang. 2018. Interpretable Parallel Recurrent Neural Networks with Convolutional Attentions for Multi-Modality Activity Modeling. International Joint Conference on Neural Networks (IJCNN) (2018).
8Chen et al . (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems . 2172–2180.