Graph-Embedded Multi-layer Kernel Extreme Learning Machine for One-class   Classification or (Graph-Embedded Multi-layer Kernel Ridge Regression for   One-class Classification)

Chandan Gautam; Aruna Tiwari; M. Tanveer

arXiv:1904.06491·cs.LG·April 16, 2019

Graph-Embedded Multi-layer Kernel Extreme Learning Machine for One-class Classification or (Graph-Embedded Multi-layer Kernel Ridge Regression for One-class Classification)

Chandan Gautam, Aruna Tiwari, M. Tanveer

PDF

Open Access

TL;DR

This paper introduces a novel multi-layer graph-embedded kernel ridge regression auto-encoder architecture for one-class classification, effectively detecting outliers using only normal samples, and demonstrates its superiority over existing methods on multiple datasets.

Contribution

It proposes a multi-layer graph-embedded kernel ridge regression auto-encoder framework for OCC, integrating local and global variance-based graph embeddings, and provides four variants outperforming state-of-the-art methods.

Findings

01

Four variants outperform existing OCC classifiers

02

Statistical significance confirmed by Friedman test

03

Effective on 21 benchmark datasets

Abstract

A brain can detect outlier just by using only normal samples. Similarly, one-class classification (OCC) also uses only normal samples to train the model and trained model can be used for outlier detection. In this paper, a multi-layer architecture for OCC is proposed by stacking various Graph-Embedded Kernel Ridge Regression (KRR) based Auto-Encoders in a hierarchical fashion. These Auto-Encoders are formulated under two types of Graph-Embedding, namely, local and global variance-based embedding. This Graph-Embedding explores the relationship between samples and multi-layers of Auto-Encoder project the input features into new feature space. The last layer of this proposed architecture is Graph-Embedded regression-based one-class classifier. The Auto-Encoders use an unsupervised approach of learning and the final layer uses semi-supervised (trained by only positive samples and obtained…

Figures3

Click any figure to enlarge with its caption.

Tables3

Table 1. Table 1 : Dataset Description

S. No.	Name	#Targets	#Outliers	#Features	#samples \bigstrut
Financial Credit Approval Datasets \bigstrut
1	Australia(1)	307	383	14	690 \bigstrut[t]
2	Australia(2)	383	307	14	690 \bigstrut[b]
3	German(1)	700	300	24	1000 \bigstrut[t]
4	German(2)	300	700	24	1000 \bigstrut[b]
5	Japan(1)	294	357	15	651 \bigstrut[t]
6	Japan(2)	357	294	15	651 \bigstrut[b]
Medical Disease Datasets \bigstrut
7	Bupa(1)	145	200	6	345 \bigstrut[t]
8	Bupa(2)	200	145	6	345 \bigstrut[b]
9	Ecoli(1)	143	193	7	336 \bigstrut[t]
10	Ecoli(2)	193	143	7	336 \bigstrut[b]
11	Heart(1)	160	137	13	297 \bigstrut[t]
12	Heart(2)	137	160	13	297 \bigstrut[b]
13	Pima(1)	500	268	8	768 \bigstrut[t]
14	Pima(2)	268	500	8	768 \bigstrut[b]
Miscellaneous Datasets \bigstrut
15	Glass(1)	76	138	9	214 \bigstrut[t]
16	Glass(2)	138	76	9	214 \bigstrut[b]
17	Iono(1)	225	126	34	351 \bigstrut[t]
18	Iono(2)	126	225	34	351 \bigstrut[b]
19	Iris(1)	50	100	4	150 \bigstrut[t]
20	Iris(2)	50	100	4	150
21	Iris(3)	50	100	4	150 \bigstrut[b]

Table 2. Table 5 : η f subscript 𝜂 𝑓 \eta_{f} and η m subscript 𝜂 𝑚 \eta_{m} of all one-class classifiers in increasing order of the η f subscript 𝜂 𝑓 \eta_{f} (less value of η f subscript 𝜂 𝑓 \eta_{f} indicates better performance).

One-class Classifier	$η_{f}$	$η_{m}$ (%) \bigstrut
GMKOC-CDA_ $θ$ 1	4.52	75.10 \bigstrut[t]
GMKOC-CDA_ $θ$ 2	4.81	75.01
LMKOC-LLE_ $θ$ 2	5.19	75.43
LMKOC-LLE_ $θ$ 1	6.33	74.75
\bigstrut[t] GKOC-SV	7.00	73.04
OCSVM	7.36	72.22
SVDD	7.64	72.10
GKOC-CV	8.24	72.41
KOC	8.31	71.85
AEKOC	8.95	72.23
LKOC-LE	9.33	71.16
GKOC-CDA	9.48	71.44
KPCA	10.14	69.31
GKOC-LDA	10.67	70.77
LKOC-LLE	12.02	70.14 \bigstrut[b]

Table 3. Table 6 : η p subscript 𝜂 𝑝 \eta_{p} value over 21 21 21 datasets

One-class Classifiers	$η_{p}$ (%) \bigstrut
GMKOC-CDA_ $θ$ 1	96.33 \bigstrut[t]
LMKOC-LLE_ $θ$ 2	96.20
GMKOC-CDA_ $θ$ 2	96.02
LMKOC-LLE_ $θ$ 1	95.44
\bigstrut[t] GKOC-SV	93.41
GKOC-CV	92.82
OCSVM	92.38
SVDD	92.27
AEKOC	92.00
KOC	91.83
GKOC-CDA	91.16
LKOC-LE	90.86
GKOC-LDA	90.33
LKOC-LLE	89.63
KPCA	89.20 \bigstrut[b]

Equations111

D_{ii}^{h} = j = 1 \sum N V_{ij}^{h}

D_{ii}^{h} = j = 1 \sum N V_{ij}^{h}

v_{ij}^{h} = e x p - \frac{ϕ _{i}^{h} - ϕ _{j}^{h} _{2}^{2}}{2 σ ^{2}}

v_{ij}^{h} = e x p - \frac{ϕ _{i}^{h} - ϕ _{j}^{h} _{2}^{2}}{2 σ ^{2}}

V_{ij}^{h} = {v_{ij}^{h}, 0, if ϕ_{j}^{h} \in N_{i}^{h} otherwise

V_{ij}^{h} = {v_{ij}^{h}, 0, if ϕ_{j}^{h} \in N_{i}^{h} otherwise

S^{h} = Φ^{h} L^{h} (Φ^{h})^{T}

S^{h} = Φ^{h} L^{h} (Φ^{h})^{T}

Minimize : £_{K A E} = \frac{1}{2} β_{a}^{h}^{2} + \frac{C}{2} i = 1 \sum N e_{i}^{h}_{2}^{2}

Minimize : £_{K A E} = \frac{1}{2} β_{a}^{h}^{2} + \frac{C}{2} i = 1 \sum N e_{i}^{h}_{2}^{2}

Subject to : (β_{a}^{h})^{T} ϕ_{i}^{h} = x_{i}^{h - 1} - e_{i}^{h}, i = 1, 2, ..., N,

\displaystyle\text{Minimize}:\pounds_{LKAE}=\frac{1}{2}Tr\Big{(}\bm{(\beta_{a}^{h})^{T}(S^{h}}+\lambda\bm{I)\beta_{a}^{h}}\Big{)}+\frac{C}{2}\sum_{i=1}^{N}\left\|\bm{e_{i}^{h}}\right\|_{2}^{2}

\displaystyle\text{Minimize}:\pounds_{LKAE}=\frac{1}{2}Tr\Big{(}\bm{(\beta_{a}^{h})^{T}(S^{h}}+\lambda\bm{I)\beta_{a}^{h}}\Big{)}+\frac{C}{2}\sum_{i=1}^{N}\left\|\bm{e_{i}^{h}}\right\|_{2}^{2}

Subject to : (β_{a}^{h})^{T} ϕ_{i}^{h} = x_{i}^{h - 1} - e_{i}^{h}, i = 1, 2, ..., N,

β_{a}^{h} = Φ^{h} W_{a}^{h} .

β_{a}^{h} = Φ^{h} W_{a}^{h} .

\displaystyle\text{Minimize}:\pounds_{LKAE}=\frac{1}{2}Tr\Big{(}\bm{(W_{a}^{h})^{T}(\Phi^{h})^{T}(\Phi^{h}\mathcal{L}^{h}(\Phi^{h})^{T}}

\displaystyle\text{Minimize}:\pounds_{LKAE}=\frac{1}{2}Tr\Big{(}\bm{(W_{a}^{h})^{T}(\Phi^{h})^{T}(\Phi^{h}\mathcal{L}^{h}(\Phi^{h})^{T}}

\displaystyle+\lambda\bm{I)\Phi^{h}W_{a}^{h}}\Big{)}+\frac{C}{2}\sum_{i=1}^{N}\left\|\bm{e_{i}^{h}}\right\|_{2}^{2},

Subject to : (W_{a}^{h})^{T} (ϕ_{i}^{h})^{T} ϕ_{i}^{h} = x_{i}^{h - 1} - e_{i}^{h}, i = 1, 2, ..., N .

\displaystyle\text{Minimize}:\pounds_{LKAE}=\frac{1}{2}Tr\Big{(}\bm{(W_{a}^{h})^{T}(K^{h}\mathcal{L}^{h}K^{h}}+\lambda\bm{K^{h})W_{a}^{h}}\Big{)}

\displaystyle\text{Minimize}:\pounds_{LKAE}=\frac{1}{2}Tr\Big{(}\bm{(W_{a}^{h})^{T}(K^{h}\mathcal{L}^{h}K^{h}}+\lambda\bm{K^{h})W_{a}^{h}}\Big{)}

+ \frac{C}{2} i = 1 \sum N e_{i}^{h}_{2}^{2},

Subject to : (W_{a}^{h})^{T} k_{i}^{h} = x_{i}^{h - 1} - e_{i}^{h}, i = 1, 2, ..., N .

\displaystyle\pounds_{LKAE}=\frac{1}{2}Tr\Big{(}\bm{(W_{a}^{h})^{T}(K^{h}\mathcal{L}^{h}K^{h}}+\lambda\bm{K^{h})W_{a}^{h}}\Big{)}

\displaystyle\pounds_{LKAE}=\frac{1}{2}Tr\Big{(}\bm{(W_{a}^{h})^{T}(K^{h}\mathcal{L}^{h}K^{h}}+\lambda\bm{K^{h})W_{a}^{h}}\Big{)}

+ \frac{C}{2} i = 1 \sum N e_{i}^{h}_{2}^{2} - i = 1 \sum N α_{i}^{h} ((W_{a}^{h})^{T} k_{i}^{h} - x_{i}^{h - 1} + e_{i}^{h})

\frac{\partial £ _{L K A E}}{\partial W _{a}^{h}} = 0 \Rightarrow W_{a}^{h} = (L^{h} K^{h} + λ I)^{- 1} α^{h}

\frac{\partial £ _{L K A E}}{\partial W _{a}^{h}} = 0 \Rightarrow W_{a}^{h} = (L^{h} K^{h} + λ I)^{- 1} α^{h}

\frac{\partial £ _{L K A E}}{\partial e _{i}^{h}} = 0 \Rightarrow E^{h} = \frac{1}{C} α^{h}

\frac{\partial £ _{L K A E}}{\partial e _{i}^{h}} = 0 \Rightarrow E^{h} = \frac{1}{C} α^{h}

\frac{\partial £ _{L K A E}}{\partial α _{i}^{h}} = 0 \Rightarrow (W_{a}^{h})^{T} K^{h} = X^{h - 1} - E^{h}

\frac{\partial £ _{L K A E}}{\partial α _{i}^{h}} = 0 \Rightarrow (W_{a}^{h})^{T} K^{h} = X^{h - 1} - E^{h}

W_{a}^{h}

W_{a}^{h}

β_{a}^{h}

β_{a}^{h}

Minimize : £_{L M K O C^{d}} = \frac{1}{2} (β_{o}^{d})^{T} (S^{d} + λI) β_{o}^{d} + \frac{C}{2} i = 1 \sum N e_{i}^{d}_{2}^{2}

Minimize : £_{L M K O C^{d}} = \frac{1}{2} (β_{o}^{d})^{T} (S^{d} + λI) β_{o}^{d} + \frac{C}{2} i = 1 \sum N e_{i}^{d}_{2}^{2}

Subject to : (β_{o}^{d})^{T} ϕ_{i}^{d} = r - e_{i}^{d}, i = 1, 2, ..., N,

β_{o}^{d} = Φ^{d} W_{o}^{d} .

β_{o}^{d} = Φ^{d} W_{o}^{d} .

S^{d} = Φ^{d} L^{d} (Φ^{d})^{T}

S^{d} = Φ^{d} L^{d} (Φ^{d})^{T}

Minimize : £_{L M K O C^{d}} = \frac{1}{2} (W_{o}^{d})^{T} (Φ^{d})^{T} (Φ^{d} L^{d} (Φ^{d})^{T}

Minimize : £_{L M K O C^{d}} = \frac{1}{2} (W_{o}^{d})^{T} (Φ^{d})^{T} (Φ^{d} L^{d} (Φ^{d})^{T}

+ λ I) Φ^{d} W_{o}^{d} + \frac{C}{2} i = 1 \sum N e_{i}^{d}_{2}^{2}

Subject to : (W_{o}^{d})^{T} (ϕ_{i}^{d})^{T} ϕ_{i}^{d} = r - e_{i}^{d}, i = 1, 2, ..., N

Minimize : £_{L M K O C^{d}} = \frac{1}{2} (W_{o}^{d})^{T} (K^{d} L^{d} K^{d} + λ K^{d}) W_{o}^{d}

Minimize : £_{L M K O C^{d}} = \frac{1}{2} (W_{o}^{d})^{T} (K^{d} L^{d} K^{d} + λ K^{d}) W_{o}^{d}

+ \frac{C}{2} i = 1 \sum N e_{i}^{d}_{2}^{2},

Subject to : (W_{o}^{d})^{T} k_{i}^{d} = r - e_{i}^{d}, i = 1, 2, ..., N .

£_{L M K O C^{d}} = \frac{1}{2} (W_{o}^{d})^{T} (K^{d} L^{d} K^{d} + λ K^{d}) W_{o}^{d}

£_{L M K O C^{d}} = \frac{1}{2} (W_{o}^{d})^{T} (K^{d} L^{d} K^{d} + λ K^{d}) W_{o}^{d}

+ \frac{C}{2} i = 1 \sum N e_{i}^{d}_{2}^{2} - i = 1 \sum N α_{i}^{d} ((W_{o}^{d})^{T} k_{i}^{d} - r + e_{i}^{d})

\frac{\partial £ _{L M K O C^{d}}}{\partial W _{o}^{d}} = 0 \Rightarrow W_{o}^{d} = (L^{d} K^{d} + λ I)^{- 1} α^{h}

\frac{\partial £ _{L M K O C^{d}}}{\partial W _{o}^{d}} = 0 \Rightarrow W_{o}^{d} = (L^{d} K^{d} + λ I)^{- 1} α^{h}

\frac{\partial £ _{L M K O C^{d}}}{\partial e _{i}^{d}} = 0 \Rightarrow E^{d} = \frac{1}{C} α^{h}

\frac{\partial £ _{L M K O C^{d}}}{\partial e _{i}^{d}} = 0 \Rightarrow E^{d} = \frac{1}{C} α^{h}

\frac{\partial £ _{L M K O C^{d}}}{\partial α _{i}^{d}} = 0 \Rightarrow (W_{o}^{d})^{T} K^{d} = r - E^{h}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and ELM · Anomaly Detection Techniques and Applications

Full text

\WarningFilter

captionUnsupported document class

11institutetext: C. Gautam 22institutetext: Indian Institute of Technology Indore, India

22email: [email protected] 33institutetext: A. Tiwari 44institutetext: Indian Institute of Technology Indore, India

44email: [email protected] 55institutetext: M. Tanveer 66institutetext: Indian Institute of Technology Indore, India

66email: [email protected]

Graph-Embedded Multi-layer Kernel Extreme Learning Machine for One-class Classification

or

Graph-Embedded Multi-layer Kernel Ridge Regression for One-class Classification

or

Graph-Embedded Multi-layer Least Square SVM with zero bias for One-class Classification

Chandan Gautam

Aruna Tiwari

M. Tanveer

Abstract

Introduction: A brain can detect outlier just by using only normal samples. Similarly, one-class classification ( $OCC$ ) also uses only normal samples to train the model and trained model can be used for outlier detection.

Proposed Method: In this paper, a multi-layer architecture for $OCC$ is proposed by stacking various Graph-Embedded Kernel Ridge Regression ( $KRR$ ) based Auto-Encoders in a hierarchical fashion. These Auto-Encoders are formulated under two types of Graph-Embedding, namely, local and global variance-based embedding. This Graph-Embedding explores the relationship between samples and multi-layers of Auto-Encoder project the input features into new feature space. The last layer of this proposed architecture is Graph-Embedded regression-based one-class classifier. The Auto-Encoders use an unsupervised approach of learning and the final layer uses semi-supervised (trained by only positive samples and obtained closed-form solution) approach to learning.

Experimental Results: The proposed method is experimentally evaluated on $21$ publicly available benchmark datasets. Experimental results verify the effectiveness of the proposed one-class classifiers over $11$ existing state-of-the-art kernel-based one-class classifiers. Friedman test is also performed to verify the statistical significance of the claim of the superiority of the proposed one-class classifiers over the existing state-of-the-art methods.

Conclusion: By using two types of Graph-Embedding, $4$ variants of Graph-Embedded multi-layer $KRR$ -based one-class classifier has been presented in this paper. All $4$ variants performed better than the existing one-class classifiers in terms of various discussed criteria in this paper. Hence, it can be viable alternative for $OCC$ task. In future, various other types of Auto-Encoders can be explored within proposed architecture.

Keywords:

One-Class Classification Outlier Detection Kernel Ridge Regression Graph-Embedding Multi-layer

Why three titles? Because three methods viz; Kernel ridge regression (KRR), lease square support vector machine with zero bias (LSSVM(bias=0)) and kernel extreme learning machine (KELM), are identical in outcomes and developed by three different researchers under three different framework. Since, KRR are more genric name compared to others, we use name KRR instead of LSSVM or KELM in this paper. Proposed methods of this paper can be considered as variants of KRR or LSSVM(with bias=0) or KELM:

KELM = KRR = LSSVM(with bias=0)

1 Introduction

One-class Classification (OCC) has been widely used for outlier, novelty, fault, and intrusion detection moya1993one ; khan2009survey ; pimentel2014review ; xu2013rough ; hamidzadeh2018improved ; xiao2009multi by researchers from different disciplines. In multi-class problem, both positive and negative samples are available for training gepperth2016generative ; luria2014detection ; justodetection ; anbar2018machine . However, in OCC problems, samples of the class of interest (i.e., positive samples) are available while negative samples are very rare or costly to collect david2001tax ; park2007svdd ; liu2013svdd ; kassab2009incremental ; munoz2006estimation ; chen2017one ; hu2015privacy , thus making the application of multi-class models problematic. Various one-class classifiers pimentel2014review ; o2014anomaly have been proposed based on the regression model, the clustering model etc. One-class classification methods available in the literature can be divided into two broad categories viz., non-kernel-based and kernel-based methods. Various non-kernel-based one-class classifiers are principal component analysis based data descriptor111One-class classifiers are also known as data descriptors due to their capability to describe the distribution of data and the boundaries of the class of interest david2001tax , angle-based outlier factor data description kriegel2008angle , K-means data description david2001tax , self-organizing map data description david2001tax , Auto-Encoder data descriptor japkowicz1999concept etc. Whereas, the kernel-based one-class classifiers are support vector data description tax2004support , one-class support vector machinescholkopf1999support , kernel principal component analysis based data description hoffmann2007kernel etc. However, kernel-based methods have been shown to outperform non-kernel-based methods in the literature pimentel2014review ; david2001tax . Despite this fact, these kernel-based methods involve the solution of a quadratic optimization problem, which is computationally expensive. Apart of these kernel-based methods, $KRR$ -based models saunders1998ridge optimize the problem rapidly in a non-iterative way by solving a linear systems. Therefore, $KRR$ -based models saunders1998ridge ; wornyo2018co ; zhang2017benchmarking ; he2014kernel ; wu2017cost have received quite attention by researchers for solving various types of problems viz., regression, binary, multi-class etc.

In recent years, various $KRR$ -based222Methods discussed in this paragraph have used name KELM in their paper. Since, KELM and KRR are identical as discussed in the above paragraph, we use more generic name KRR instead of KELM. one-class classifiers have been developed and exhibited better performance compared to various state-of-the-art one-class classifiers. Overall, the $KRR$ -based one-class classifiers can be divided into two types, namely, (i) without Graph-Embedding (ii) with Graph-Embedding. For ‘without Graph-Embedding’, two types of architectures have been explored for OCC. One is $KRR$ -based single output node architecture leng2014one 2, and other is $KRR$ -based Auto-Encoder architecture gautam2017construction 2. For ‘with Graph-Embedding’, Iosifidis et al.iosifidis2016one 2 presented local and global variance-based Graph-Embedded one-class classifier. Different types of Laplacian Graphs are employed by Iosifidis et al.iosifidis2016one for local (i.e., Local Linear Embedding, Laplacian Eigenmaps etc.) and global (linear discriminant analysis and clustering-based discriminant analysis etc.) variance embedding. Later, global variance-based Graph-Embedding has been extended in order to exploit class variance and sub-class variance information for face verification task by Mygdalis et al.mygdalis2016one 2. All the above-mentioned $KRR$ -based one-class classifiers employ only single-layered architecture.

Over the last decade, stacked Auto-encoder based multi-layer architectures have received quite attention by researchers for multi-class and binary class classification tasks bengio2009learning ; schmidhuber2015deep . Such architectures can lead to better representation learning vincent2008extracting ; shin2013stacked and also used in dimensionality reduction hinton2006reducing ; van2009dimensionality ; wang2014generalized . High-level feature representations obtained by using stacked Auto-Encoder also helps in improving the performance of the traditional classifiers vincent2010stacked . This paper explores the possibility of $KRR$ -based representation learning using stacked Auto-Encoder for the one-class classification task.

In this paper, we propose a multi-layer architecture by stacking various Graph Embedded $KRR$ -based Auto-Encoders (trained using unsupervised learning) in a hierarchical manner for one-class classification task. These Auto-Encoders are designed to exploit two types of data relationships encoded in graphs yan2007graph , i.e. local and global variance information-based Graph-Embedding. These information are incorporated in the Auto-Encoder training process in order to simultaneously enhance the data reconstruction ability, data representation ability, and the class compactness in the derived feature space. The multiple layers exploit the idea of successive nonlinear data mappings and hence capture the relationship effectively. After stacking several Auto-Encoder layers in a hierarchical manner, data are represented in a new feature space in which Graph-Embedded regression-based one-class classifier is employed in the final layer. At final layer, output of the stacked Auto-Encoder is approximated to any real number and set a threshold for deciding whether any sample is outlier or not. Two types of threshold deciding criteria (i.e. $\theta 1$ and $\theta 2$ ) are discussed so far in this paper. By employing different realizations of the proposed Auto-Encoder, two different architectures are formed based on the local and global variance criteria and are referred as $LMKOC$ and $GMKOC$ , respectively. Both architectures are experimented with two types of threshold criteria and developed 4 variants of the Graph-Embedded multi-layer one-class classifier. Further, the performance of $GMKOC$ and $LMKOC$ are evaluated using $21$ benchmark datasets and its performance is compared with $11$ state-of-the-art kernel-based methods available in the literature. Finally, a Friedman test demvsar2006statistical is conducted to verify the statistical significance of the experimental outcomes of the proposed classifiers and it rejects the null hypothesis with $95\%$ confidence level.

The rest of the paper is organized as follows. Section 2 describes the $LMKOC$ and $GMKOC$ in detail. Performance evaluation is provided in Section 3. Finally, Section 4 concludes our work.

2 Proposed Method

In this section, a Graph-Embedded multi-layer KRR-based architecture for one-class classification is described. The proposed multi-layer architecture is constructed by stacking various Graph-Embedded $KRR$ -based Auto-Encoders, followed by a Graph-Embedded $KRR$ -based one-class classifier, as shown in Fig. 1. Graph-Embedding is performed by two types of variances information viz., local and global variance. One is referred as Local variance based Graph-Embedded Multi-layer $\bm{K}RR$ for One-class Classification ( $\bm{LMKOC}$ ), and other is referred as Global variance based Graph-Embedded Multi-layer $\bm{K}RR$ for One-class Classification ( $\bm{GMKOC}$ ). Local and global variance-based kernelized Auto-Encoders are referred as $LKAE$ and $GKAE$ , respectively.

During construction of multi-layer architecture, use either local or global variance for every layers of the architecture. As shown in Fig. 1, $GMKOC$ / $LMKOC$ is constructed by stacking various $GKAE$ s/ $LKAE$ s333Here, ’/’ denotes or. $GMKOC$ uses $GKAE$ and $LMKOC$ uses $LKAE$ .. These stacked Auto-Encoders are employed for defining the successive data representation. In the $1^{st}$ $GKAE$ / $LKAE$ of this figure, input training matrix is denoted by $\bm{X=X^{0}=\left\{x_{i}^{0}\right\}}$ , where $\bm{x_{i}^{0}}=[x_{i1}^{0},x_{i2}^{0},...,x_{in}^{0}]$ , $i=1,2,...,N$ , is the $n$ -dimensional input vector of the $i^{th}$ training sample. Let us assume that there are $d$ layers in the proposed architecture, i.e., $h=1,2,...,d$ . Output of the $h^{th}$ layer is passed as input to the $(h+1)^{th}$ layer. Let us denote output at $h^{th}$ layer of Auto-Encoder, $\bm{X^{h}=\left\{x_{i}^{h}\right\}}$ , where $\bm{x_{i}^{h}}=[x_{i1}^{h},x_{i2}^{h},...,x_{in}^{h}]$ , $i=1,2,...,N$ . $\bm{X^{h}}$ corresponds to the output of the $h^{th}$ Auto-Encoder and the input of the $(h+1)^{th}$ Auto-Encoder. Each of the Auto-Encoders involves a data mapping using function $\phi(.)$ , mapping $\bm{X^{h-1}}$ to $\bm{\phi^{h}=\phi(X^{h-1})}$ . $\phi(.)$ corresponds to a mapping of $\bm{X^{h-1}}$ to the corresponding kernel space $\bm{K^{h}}=\bm{(\Phi^{h})^{T}\Phi^{h}}$ . Here, $\bm{\Phi^{h}}=\left[$ . The data representation obtained by calculating the output of the $(d-1)^{th}$ Auto-Encoder in the architecture is passed to the $d^{th}$ layer for OCC using $GMKOC^{d}$ / $LMKOC^{d}$ . Here, $GMKOC^{d}$ / $LMKOC^{d}$ denotes $d^{th}$ layer of $GMKOC$ / $LMKOC$ . In the give figure, Graph-Embedding is performed by using a scattered matrix $\bm{S^{h}}$ , which encodes the local or global variance information with the kernel matrix. Here, $\bm{S^{h}}$ denotes scattered matrix of $h^{th}$ layer. Two types of training errors and weight matrices are generated by $GMKOC$ / $LMKOC$ . The first type of training error matrix and weight matrix are generated by the $h^{th}$ Auto-Encoder until $(d-1)$ layers and denoted as $\bm{E^{h}=\left\{e_{i}^{h}\right\}}$ and $\bm{\beta_{a}^{h}}$ , where $i=1,2,...,N$ and $h=1,2,...,(d-1)$ , respectively. And the other type of training error vector and weight vector are generated by the one-class classifier at $d^{th}$ layer and denoted as $\bm{E^{d}}=\left\{e_{i}^{d}\right\}$ and $\bm{\beta_{o}^{d}}$ , where $i=1,2,...,N$ , respectively. Based on the above notations, proposed methods $GMKOC$ and $LMKOC$ are discussed in the next subsections.

2.1 Local Variance Information based Graph-Embedded Multi-layer $KRR$ for One-class Classification: $LMKOC$

In this subsection, $LMKOC$ is proposed. This multi-layer architecture exploits Local variance information with $\bm{KAE}$ s ( $LKAE$ s). The overall architecture of $LMKOC$ is formed by two processing steps.

In first step, $(d-1)$ $LKAE$ s are trained, each defining a triplet ( $\bm{X^{h},\beta_{a}^{h},S^{h}}$ ), and stacked in a hierarchical manner. A $LKAE$ involves non-linear mapping $\bm{X^{h-1}\rightarrow\Phi^{h}}$ and, subsequently, defines a graph $\mathcal{G}^{h}=\bm{\left\{\Phi^{h},V^{h}\right\}}$ where $\bm{V^{h}}\in\mathbb{R}^{N\times N}$ is the weight matrix expressing similarities between the graph nodes $\bm{\phi}i\bm{{}^{h}}\in\Phi^{h}$ . The Graph Laplacian matrix of the $h^{th}$ $LKAE$ is calculated by $\bm{\mathcal{L}^{h}=D^{h}-V^{h}}$ , where $\bm{D^{h}}$ is a diagonal degree matrix in the $h^{th}$ layer defined as yan2007graph :

[TABLE]

Any type of local variance based Laplacian Graph (e.g. Laplacian Eigenmaps ( $LE$ ) belkin2003laplacian , Locally Linear Embedding ( $LLE$ ) saul2003think etc.) can be exploited in the $LKAE$ . In our experiments, we have used the fully connected and k-nearest neighbor graph using the heat kernel function:

[TABLE]

where, $\sigma$ is a hyper-parameter scaling the square Euclidean distance between $\bm{\phi_{i}^{h}}$ and $\bm{\phi_{j}^{h}}$ . In the case of k-nearest neighbor Graph, the weight matrix $\bm{V^{h}}$ is defined as follows:

[TABLE]

where, $\mathcal{N}_{i}^{h}$ denotes the neighborhood of $\bm{\phi_{i}^{h}}$ . Using the above notation, the scatter matrix $\bm{S^{h}}$ encoding the local variance information is given by:

[TABLE]

Minimization criterion of $LKAE$ is derived by using vanilla $KRR$ -based Auto-Encoder ( $KAE$ ). A $KAE$ can be formulated as follows:

[TABLE]

where $C$ is a regularization parameter, and $\bm{e_{i}^{h}}$ is a training error vector corresponding to the $i^{th}$ training sample at $h^{th}$ layer. Based on the minimization criterion in (5), $LKAE$ can be formulated as follows:

[TABLE]

Based on the Representer Theorem argyriou2009there , we express $\bm{\beta^{h}_{a}}$ as a linear combination of the training data representation $\bm{\Phi^{h}}$ and a reconstruction weight matrix $\bm{W_{a}^{h}}$ :

[TABLE]

Hence, by using Representer Theorem argyriou2009there , minimization criterion in (6) is reformulated as follows:

[TABLE]

By further substitution of $\bm{K^{h}}=\bm{(\Phi^{h})^{T}\Phi^{h}}$ , where $\bm{k_{i}^{h}}\subseteq\bm{K^{h}}$ is formed by the elements $\bm{k_{ij}^{h}}=\bm{(\phi_{i}^{h})^{T}\phi_{j}^{h}}$ , the criterion in (8) can be written as:

[TABLE]

The Lagrangian relaxation of (9) is shown below in (10):

[TABLE]

where $\bm{\alpha^{h}=\{\alpha_{i}^{h}\}},i=1,2\ldots N$ , is a Lagrangian multiplier. In order to optimize (10), we compute its derivatives as follows:

[TABLE]

The matrix $\bm{W^{h}_{a}}$ is obtained by substituting (12) and (13) into (11), and is given by:

[TABLE]

Now, $\bm{\beta^{h}_{a}}$ can be derived by substituting (14) into (7):

[TABLE]

After mapping the training data through the $(d-1)$ successive $LKAE$ s in the first step, the training data representations defined by the outputs of the $(d-1)^{th}$ $LKAE$ are used in order to train a Local variance based Graph-Embedded Multi-layer KRR for OCC at $d^{th}$ layer ( $LMKOC^{d}$ ) in the second step. The $LMKOC^{d}$ involves a nonlinear mapping $\bm{X^{d-1}\rightarrow\Phi^{d}}$ and is trained by solving the following optimization problem:

[TABLE]

By using Representer Theorem argyriou2009there , $\bm{\beta^{d}_{o}}$ is expressed as a linear combination of the training data representation $\bm{\Phi^{d}}$ and reconstruction weight vector $\bm{W_{o}^{d}}$ :

[TABLE]

The scatter matrix $\bm{S^{d}}$ encodes the local variance information at $d^{th}$ layer, and is given by:

[TABLE]

Now, by using (17) and (18), the minimization criterion in (16) is reformulated to the following:

[TABLE]

In addition, by substituting $\bm{K^{d}}=\bm{(\Phi^{d})^{T}\Phi^{d}}$ , where $\bm{k_{i}^{d}}\subseteq\bm{K^{d}}$ , the optimization problem in (19) can be reformulated as follows:

[TABLE]

The Lagrangian relaxation of (20) is shown below in (21):

[TABLE]

where $\bm{\alpha^{d}=\{\alpha_{i}^{d}\}},i=1,2\ldots N$ , is a Lagrangian multiplier. In order to optimize (21), we compute its derivatives as follows:

[TABLE]

The matrix $\bm{W_{o}^{d}}$ is obtained by substituting (23) and (24) into (22), and is given by:

[TABLE]

$\bm{\beta^{d}_{o}}$ can be derived by substituting (25) into (17):

[TABLE]

The predicted output of the final layer (i.e., $d^{th}$ layer) of the multi-layer architecture for training samples can be calculated as follows:

[TABLE]

where $\bm{\widehat{O}}$ is the predicted output for training data.

After completing the training process, a threshold is required to decide whether any sample is an outlier or not. Two types of threshold criteria ( $\theta 1$ and $\theta 2$ ) are discussed in Subsection 2.3.

The overall processing steps of $LMKOC$ is described in the Algorithm 1.

2.2 Global Variance Information based Graph-Embedded Multi-layer $KRR$ for One-class Classification: $GMKOC$

In this subsection, $GMKOC$ is proposed. In order to exploit global variance information for Auto-Encoder training, we define the variance $(\bm{Z^{h}})$ of the training data representations for the $h^{th}$ Auto-Encoder as follows:

[TABLE]

where $\overline{\bm{\Phi^{h}}}$ is the mean training vector in the kernel space of the $h^{th}$ Auto-Encoder, i.e. $\overline{\bm{\Phi^{h}}}=\frac{1}{N}\sum_{i=1}^{N}\bm{\phi_{i}^{h}}$ . $\bm{Z^{h}}$ can be expressed in the form:

[TABLE]

where, $\bm{1}\in\mathbb{R}^{N}$ is a vector of ones, $\bm{I}\in\mathbb{R}^{N\times N}$ is the identity matrix, and $\bm{\mathcal{Z}^{h}}$ represents Graph Laplacian matrix for $h^{th}$ layer. Any type of global variance based Laplacian Graph (e.g. Linear Discriminant Analysis ( $LDA$ ) duda1973pattern and Clustering-based Discriminant Analysis ( $CDA$ ) etc.) can be exploited in the $GMKOC$ .

Minimization problems and their solutions for global variance case can be simply obtained from the equations of local variance (Section 2.1) by using $\bm{Z^{h}}$ instead of $\bm{S^{h}}$ . Hence, optimization problem for Global variance information based $KAE$ ( $GKAE$ ) is written as follows by using $\bm{Z^{h}}$ instead of $\bm{S^{h}}$ in (6):

[TABLE]

The use of (30) for the optimization of the proposed $GMKOC$ , which minimizes the training error as well as class compactness simultaneously. This can be seen by expressing (30) using (28) as follows:

[TABLE]

where, $\bm{o_{i}^{h}}=\bm{(\beta_{a}^{h})^{T}\phi_{i}^{h}}$ and $\bm{o^{h}}=\bm{(\beta_{a}^{h})^{T}}\overline{\bm{\Phi^{h}}}$ . Here, the regularization parameter $C$ provides the trade-off between the two objectives viz., minimizing the training error and class compactness.

Above minimization problem can be easily solved in a similar manner as solve the (6) in previous subsection. Hence, for global variance, we are providing only final solutions of the above minimization problems, due to space constraint, by using $\bm{\mathcal{Z}^{h}}$ instead of $\bm{\mathcal{L}^{h}}$ in (14) and (15). The weights $\bm{W^{h}_{a}}$ and $\bm{\beta^{h}_{a}}$ for $GKAE$ are given by:

[TABLE]

After mapping the training data through the $(d-1)$ successive Auto-Encoder layers in the first step, the training data representations defined by the outputs of the $(d-1)^{th}$ $GKAE$ are used in order to train a Global variance based Graph-Embedded Multi-layer KRR for OCC at $d^{th}$ layer ( $GMKOC^{d}$ ) in the second step. Optimization problem of $GMKOC^{d}$ is written as follows by using $\bm{Z^{d}}$ instead of $\bm{S^{d}}$ in (16):

[TABLE]

Above minimization problem can be solved similar as (16). Further, by using $\bm{\mathcal{Z}^{d}}$ instead of $\bm{\mathcal{L}^{d}}$ in (25) and (26), its weight vectors $\bm{W_{o}^{d}}$ and $\bm{\beta^{d}_{o}}$ are obtained as follows:

[TABLE]

The predicted output of the final layer (i.e., $d^{th}$ layer) of the multi-layer architecture for training samples can be calculated as mentioned in (27) of previous subsection. The decision process for a test vector, whether it is outlier or not, is discussed in Subsection 2.3.

The overall processing steps followed by $GMKOC$ are described in Algorithm 1.

2.3 Decision Function

Two types of thresholds namely, $\theta 1$ and $\theta 2$ , are employed with the proposed methods, which are determined as follows:

For $\theta 1$ :

(i)

Calculate distance between the predicted value of the $i^{th}$ training sample and $r$ , and store in a vector $\bm{d}$ as follows:

[TABLE] 2. (ii)

After storing all distances in $\bm{d}$ as per (37), sort these distances in decreasing order and denoted by a vector $\bm{d_{dec}}$ . Further, reject few percent of training samples based on the deviation. Most deviated samples are rejected first because they are most probably far from the distribution of the target data. The threshold is decided based on these deviations as follows:

[TABLE]

where $0<\eta\leq 1$ is the fraction of rejection of training samples for deciding threshold value. $N$ is the number of training samples and $\lfloor\text{ }\rfloor$ denotes the floor operation. 2. 2.

For $\theta 2$ : Select threshold $(\theta 2)$ as a small fraction of the mean of the predicted output:

[TABLE]

where $0<\eta\leq 1$ is the fraction of rejection for deciding threshold value.

So, a threshold value can be determined by above procedures. Afterwards, during testing, a test vector $\bm{{x}_{p}}$ is fed to the trained multi-layer architecture and its output $\widehat{O}_{p}$ is obtained. Further, compute $\widehat{d}$ for any one types of threshold as follows:

For $\theta 1$ , calculate the distance ( $\widehat{d}$ ) between the predicted value $\widehat{O}_{p}$ of the $p^{th}$ testing sample and $r$ as follows:

[TABLE]

For $\theta 2$ , calculate the distance ( $\widehat{d}$ ) between the predicted value $\widehat{O}_{p}$ of the $p^{th}$ testing sample and mean of the predicted values obtained after training as follows:

[TABLE]

Finally, $\bm{{x}_{p}}$ is classified based on the following rule:

[TABLE]

3 Experimental Results

In this section, experiments are conducted to evaluate the performance of the proposed MKOC over $21$ data sets. These datasets are obtained from University of California Irvine (UCI) repository Lichman:2013 and were originally generated for the binary or multi-class classification task. For our experiments, we have made it compatible with OCC task in the following ways. If a dataset has two or more than two classes then alternately, we use each of the classes in the dataset as the target class and the remaining classes as outlier class. In this way, we construct $21$ one-class datasets from $10$ multi-class datasets. Description of these datasets can be found in Table 1. These $21$ datasets can be divided into 3 category viz., $6$ financial, $8$ medical and $7$ miscellaneous datasets. Many of the datasets are slightly imbalanced. Class imbalance ratio of both of the classes are approximately $1:2$ in case of $11$ datasets viz., German( $1$ ), German( $2$ ), Pima( $1$ ), Pima( $2$ ), Glass(1), Glass( $2$ ), Iono(1), Iono(2), Iris(1), Iris(2), and Iris(3). Here, all $7$ miscellaneous datasets are imbalanced in nature. All experiments on these datasets are carried out with MATLAB 2016a on Windows $7$ (Intel Xeon $3$ GHz processor, $64$ GB RAM) environment.

3.1 Nomenclature of the Proposed and Existing Methods

Based on the multi-layer OCC described in the previous section, four variants have been proposed using two types of threshold criteria (viz., $\theta 1$ and $\theta 2$ ). Those variants are $LMKOC\mathchar 45\relax LLE\_\theta 1$ , $LMKOC\mathchar 45\relax LLE\_\theta 2$ , $GMKOC\mathchar 45\relax CDA\_\theta 1$ , and $GMKOC\mathchar 45\relax CDA\_\theta 2$ . Here, name of the used Laplacian graph and types of threshold criteria are concatenated with the name of the proposed methods.

Total $11$ existing kernel-based one-class classifiers are employed for the comparison purpose, which can be categorized as follows:

(i)

Support Vector Machine ( $SVM$ ) based: One-class SVM ( $OCSVM$ ) scholkopf1999support , Support Vector Data Description ( $SVDD$ ) tax1999support 2. (ii)

$KRR$ -based:

(a)

Without Graph-Embedding: $\textbf{K}RR$ -based OCC ( $KOC$ ) leng2014one and $\textbf{K}RR$ -based Auto-Encoder model for OCC ( $AEKOC$ ) gautam2017construction 2. (b)

With Graph-Embedding: Two types of Graph-Embedding, i.e., Local and Global, have been explored in the literature. Local and Global Graph-Embedding with $KOC$ are named as $LKOC$ -X iosifidis2016one and $GKOC$ -X iosifidis2016one ; mygdalis2016one , respectively. Here, X can be any Laplacian Graph with local or global Graph-embedding. For local, two types of Graphs are explored viz., Local Linear Embedding ( $LLE$ ) and Laplacian Eigenmaps ( $LE$ ). For global, four types of Graphs are explored viz., Linear Discriminant Analysis ( $LDA$ ), Clustering-based LDA ( $CDA$ ), class variance ( $CV$ ), and sub-class variance ( $SV$ ). Hence, final six existing variants are generated namely, $LKOC\mathchar 45\relax LE$ iosifidis2016one , $LKOC\mathchar 45\relax LLE$ iosifidis2016one , $GKOC\mathchar 45\relax LDA$ iosifidis2016one , $GKOC\mathchar 45\relax CDA$ iosifidis2016one , $GKOC\mathchar 45\relax CV$ mygdalis2016one and $GKOC\mathchar 45\relax SV$ mygdalis2016one . Here, we have considered the same Laplacian graphs as mentioned in iosifidis2016one . 3. (iii)

Principal Component Analysis ( $PCA$ ) based: Kernel PCA ( $KPCA$ )hoffmann2007kernel .

All existing and proposed one-class classifiers are implemented and tested in the same environment. $OCSVM$ is implemented using LIBSVM library CC01a . $SVDD$ is implemented by using DD Toolbox Ddtools2015 . Codes of all $KRR$ -based one-class classifiers were provided by the authors of the corresponding papers. The implementations of $KPCA$ hoffmann2007kernel and $AEKOC$ gautam2017construction are obtained from the links given in the paper (links are made available at the reference of the corresponding paper).

3.2 Range of the Parameters of the Proposed and Existing Methods

For all of the kernel-based methods, Radial Basis Function (RBF) kernel is employed as shown below,

[TABLE]

where $\sigma$ is calculated as the mean Euclidean distance between training vectors in the corresponding feature space. For the proposed multi-layer methods ( $LMKOC$ and $GMKOC$ ), we have used maximum $d=5$ layers and the value of $\sigma^{h}$ is calculated at each $h^{th}$ layer independently using the training data representations $X^{h-1}$ . At each layer, regularization parameter is selected from the range of $\{2^{-3},\ldots,2^{3}\}$ . The classifiers, which exploit graphs, have two regularization parameters, which are selected based on the cross-validation using values $2^{l}$ , where $l=\{-3,...,3\}$ . For the graph encoding subclass information in $GKOC\mathchar 45\relax SV$ , the number of subclasses is selected from the range $\{2,3,...,20\}$ . For $CDA$ graph-based classifiers ( $GKOC\mathchar 45\relax CDA$ , $GMKOC\mathchar 45\relax CDA\_\theta 1$ , and $GMKOC\mathchar 45\relax CDA\_\theta 2$ ), number of clusters is selected from the range $\{2,3,...,20\}$ . For the $KOC$ and $AEKOC$ methods, regularization parameter is selected from the range $\{2^{-3},\ldots,2^{3}\}$ . For $KPCA$ based OCC, the percentage of the preserved variance is selected from the range $[85,90,95]$ . The fraction of rejection $(\eta)$ of outliers during threshold selection is set equal to $0.05$ for all methods.

3.3 Performance Evaluation Criteria

Geometric mean ( $\eta_{g}$ ) is computed in the experiment for evaluating the performance of each of the classifiers and is calculated as

[TABLE]

In all our experiments, $5$ -fold cross-validation (CV) procedure is used and the average Gmean value (along with the corresponding standard deviation ( $\Delta$ )) over $5$ -fold CV are reported in the results. $\eta_{g}$ values of all of the classifiers are further analyzed by using mean of all Gmeans ( $\eta_{m}$ ) and percentage of the maximum Gmean ( $\eta_{p}$ ). $\eta_{m}$ is computed by taking average of all Gmeans obtained by a classifier over all datasets. $\eta_{p}$ is computed as follows fernandez2014we :

[TABLE]

Moreover, Friedman testing is performed to verify the statistical significance of the obtained results. To this end, similar to fernandez2014we , we also compute Friedman Rank ( $\eta_{f}$ )demvsar2006statistical for ranking the classifiers.

3.4 Performance Comparison

The Gmean ( $\eta_{g}$ ) values of the $15$ kernel-based methods are provided in Table 2-4 for financial, medical, and miscellaneous datasets, respectively. Best $\eta_{g}$ per dataset is displayed in boldface in these Tables.

As per Table 2, out of $6$ financial credit approval datasets, one of the proposed variants performs better than all $11$ existing methods in case of every dataset except German(2) dataset. For German(2) dataset, $LMKOC\mathchar 45\relax LLE\_\theta 2$ exhibits comparable performance to $GKOC\mathchar 45\relax SV$ . In case of Australian(1) dataset, all $4$ variants yield significantly (> $4\%$ ) better results compared to all of the methods presented in Table 2. Explicitly, $LMKOC\mathchar 45\relax LLE\_\theta 2$ and $GMKOC\mathchar 45\relax CDA\_\theta 2$ show improvement of $8.82\%$ and $6.8\%$ , respectively, from the best $\eta_{g}$ value of the existing methods for Australian(1) dataset. For Australian(2), Japan(1) and Japan(2) datasets, best results obtained among all of the proposed methods exhibit significant difference of $2.53\%$ , $7.49\%$ , $2.45\%$ , respectively, compared to the best $\eta_{g}$ obtained among all existing methods. Moreover, out of $6$ financial datasets, $LMKOC\mathchar 45\relax LLE\_\theta 2$ , $GMKOC\mathchar 45\relax CDA\_\theta 2$ yield best $\eta_{g}$ for $3$ and $2$ datasets, respectively.

As per Table 3, out of $8$ medical datasets, one of the proposed variants performs better than all $11$ existing methods in case of every dataset. Moreover, $GMKOC\mathchar 45\relax CDA\_\theta 1$ and $GMKOC\mathchar 45\relax CDA\_\theta 2$ , each yields best $\eta_{g}$ for $4$ datasets. For Ecoli(1) and Heart(2) datasets, $GMKOC\mathchar 45\relax CDA\_\theta 1$ exhibits significant improvement of $2.79\%$ and $2.58\%$ , respectively, from the best $\eta_{g}$ value of the existing methods.

As we have discussed earlier, all $7$ miscellaneous datasets are imbalanced. Among $7$ miscellaneous datasets in Table 4, one of the proposed variants performs better than all $11$ existing methods in case of every datasets except Glass(2) and Iono(1) datasets. Especially, for 3 datasets viz., Iono(2), Iris(1) and Iris(2) datasets, we obtain significant improvement of $10.36\%$ , $3.15\%$ , and $2.53\%$ , respectively. In case of Glass(2) dataset, all $4$ proposed variants yield better result compared to all of the methods presented in Table 4 except $GKOC\mathchar 45\relax CV$ and $KPCA$ .

Overall, it can be observed from the above discussion and Table 2-4 that $GMKOC\mathchar 45\relax CDA\_\theta 1$ , and $GMKOC\mathchar 45\relax CDA\_\theta 2$ , $LMKOC\mathchar 45\relax LLE\_\theta 2$ , $LMKOC\mathchar 45\relax LLE\_\theta 1$ , $GKOC\mathchar 45\relax CV$ , $GKOC\mathchar 45\relax SV$ , $OCSVM$ , and $SVDD$ yield best $\eta_{g}$ value for $6$ , $6$ , $4$ , $2$ , $1$ , $1$ , $1$ and $1$ 444 Here, $OCSVM$ and $SVDD$ yield best results for the same dataset i.e. Iono(1) dataset. datasets, respectively. Hence, it can be stated that global variance-based embedding performs better compared to local variance-based embedding in most of the cases. Further, we compute $\eta_{m}$ and $\eta_{p}$ for all of the classifiers to analyze the $\eta_{g}$ value more closely.

The performance of each method over $21$ datasets using $\eta_{m}$ metric is presented in Table 5 and is plotted in a decreasing order in Fig. 2. $\eta_{m}$ metric provides average $\eta_{g}$ over $21$ datasets for a classifier. Based on the obtained results in Table 5, it can be clearly stated that all $4$ proposed variants, i.e., $LMKOC\mathchar 45\relax LLE\_\theta 1$ , $LMKOC\mathchar 45\relax LLE\_\theta 2$ , $GMKOC\mathchar 45\relax CDA\_\theta 1$ , and $GMKOC\mathchar 45\relax CDA\_\theta 2$ have achieved top $4$ positions among $15$ one-class classifiers as per $\eta_{m}$ criterion. However, $GKOC\mathchar 45\relax SV$ yields best $\eta_{m}$ among existing kernel-based one-class classifiers. It is to be noted that $GMKOC\mathchar 45\relax CDA\_\theta 1$ and $GMKOC\mathchar 45\relax CDA\_\theta 2$ yield best $\eta_{g}$ for maximum number (i.e., $6$ ) of datasets, however, $LMKOC\mathchar 45\relax LLE\_\theta 2$ emerges as the best classifier as per $\eta_{m}$ criterion. This is due to substantial improvement of $\eta_{g}$ for some of the datasets viz., Australia(1), Japan(1), and Iris(1). Hence, in order to further analyze the performance of the competing one-class classifiers, $\eta_{p}$ is calculated as per 45, similar to fernandez2014we .

$\eta_{p}$ metric provides information regarding proximateness of each classifier towards maximum $\eta_{g}$ value. As it can be seen in Table 6, $LMKOC\mathchar 45\relax LLE\_\theta 1$ , $LMKOC\mathchar 45\relax LLE\_\theta 2$ , $GMKOC\mathchar 45\relax CDA\_\theta 1$ , and $GMKOC\mathchar 45\relax CDA\_\theta 2$ hold the top $4$ positions similar to the ranking based on the $\eta_{m}$ values in Fig. 2. It is to be noted that $GMKOC\mathchar 45\relax CDA\_\theta 2$ yield best $\eta_{g}$ for $6$ datasets, and $LMKOC\mathchar 45\relax LLE\_\theta 2$ for $4$ datasets, however, $LMKOC\mathchar 45\relax LLE\_\theta 2$ yield better $\eta_{p}$ value compared to $GMKOC\mathchar 45\relax CDA\_\theta 2$ . It shows that indeed, $LMKOC\mathchar 45\relax LLE\_\theta 2$ didn’t yield best $\eta_{g}$ for maximum number of datasets but its $\eta_{g}$ values are more closer (compared to $GMKOC\mathchar 45\relax CDA\_\theta 2$ ) to the best $\eta_{g}$ value of most of the datasets. In Fig. 3, $\eta_{p}$ values of $10$ out of $15$ one-class classifiers are plotted in an increasing order for all of the datasets. All $15$ classifiers are not plotted for the sake clear visibility of the plotted lines. We have selected $10$ out of $15$ one-class classifiers based on the following discussion. Two out of four proposed variants, one global ( $GMKOC\mathchar 45\relax CDA\_\theta 1$ ) and one local variance-based ( $LMKOC\mathchar 45\relax LLE\_\theta 2$ ) multi-layer one-class classifiers, are selected to plot. Further, their corresponding single-layer one-class classifiers viz., $GKOC\mathchar 45\relax CDA$ and $LKOC\mathchar 45\relax LLE$ , are also plotted. Out of two minimum class variance-based classifier ( $GKOC\mathchar 45\relax SV$ and $GKOC\mathchar 45\relax CV$ ), $GKOC\mathchar 45\relax SV$ is plotted as it yields better $\eta_{p}$ . Remaining all 5 one-class classifiers are also plotted with the above selected classifiers.

The plotted lines of the two single-layer ( $GKOC\mathchar 45\relax CDA$ and $LKOC\mathchar 45\relax LLE$ ), and their corresponding multi-layer ( $GMKOC\mathchar 45\relax CDA\_\theta 1$ and $LMKOC\mathchar 45\relax LLE\_\theta 2$ ) one-class classifiers in Fig. 3 clearly indicate the substantial performance improvement of the multi-layer version over single-layer one. Overall, Fig. 3 illustrates the clear superiority of the proposed multi-layer one-class classifiers over all $11$ existing methods. Moreover, $GMKOC\mathchar 45\relax CDA\_\theta 1$ obtains more than $93\%$ $\eta_{p}$ value for all datasets except German(1), Iris(2), and Iris(3) datasets. Detailed $\eta_{p}$ values for all $15$ classifiers over $21$ datasets are made available on the link (https://goo.gl/QqUj4c).

Above discussion suggests that all $4$ proposed variants emerge as the best performing classifier in terms of all employed performance evaluation criteria viz., $\eta_{g}$ , $\eta_{m}$ , and $\eta_{p}$ . Despite this fact, a statistical testing needs to perform for verifying this fact. In the next subsection, Friedman Rank ( $\eta_{f}$ ) testing is performed for statistical testing.

3.5 Statistical Comparison

For comparing the performance of the $4$ proposed variants viz., $LMKOC\mathchar 45\relax LLE\_\theta 1$ , $LMKOC\mathchar 45\relax LLE\_\theta 2$ , $GMKOC\mathchar 45\relax CDA\_\theta 1$ , and $GMKOC\mathchar 45\relax CDA\_\theta 2$ , with the $11$ existing kernel-based methods on $21$ benchmark datasets, a non-parametric Friedman test is employed. In the Friedman test, the null hypothesis states that the mean of individual experimental treatment is not significantly different from the aggregate mean across all treatments and the alternate hypothesis states the other way around. Friedman test mainly computes three components viz., F-score, p-value and Friedman Rank ( $\eta_{f}$ ). If the computed F-score is greater than the critical value at the tolerance level $\alpha=0.05$ , then one rejects the equality of mean hypothesis (i.e. null hypothesis). We employ the modified Friedman test demvsar2006statistical for the testing, which was proposed by Iman and Davenport iman1980approximations . The F-score obtained after employing non-parametric Friedman test is $6.33$ , which is greater than the critical value at the tolerance level $\alpha=0.05$ i.e. $6.33>1.72$ . Hence, null hypothesis can be rejected with $95\%$ of a confidence level. The computed p-value of the Friedman test is $4.9414e-11$ with the tolerance value $\alpha=0.05$ , which is much lower than $0.05$ . This small value indicates that the differences in the performance of various methods are statistically significant.

Afterwards, $\eta_{f}$ of each classifiers is also calculated to assign a rank to all $15$ one-class classifiers. Friedman test assigns a rank to all methods for each datasets. It assigns rank $1$ to the best performing algorithm, the second best rank $2$ and so on. If rank ties then average ranks are assigned demvsar2006statistical . The $\eta_{f}$ values of all classifiers are provided in increasing order (less value of $\eta_{f}$ indicates better performance) in Table 5. These values are visualized in Fig. 2 with the decreasing order of $\eta_{m}$ . All $4$ proposed variants still achieve top four positions, similar to using the $\eta_{m}$ and $\eta_{p}$ metric. From Table 5 and Fig. 2, it can be observed that $\eta_{f}$ of most of the classifiers follows a similar pattern as $\eta_{m}$ , i.e., $\eta_{f}$ increases as $\eta_{m}$ decreases. However, some of the one-class classifiers don’t follow the same pattern like $GKOC\mathchar 45\relax CV$ which has better $\eta_{m}$ but inferior $\eta_{f}$ compared to $OCSVM$ and $SVDD$ . Among $4$ proposed variants, global variance-based methods ( $GMKOC\mathchar 45\relax CDA\_\theta 1$ , and $GMKOC\mathchar 45\relax CDA\_\theta 2$ ) outperform local-variance-based methods ( $LMKOC\mathchar 45\relax LLE\_\theta 1$ , $LMKOC\mathchar 45\relax LLE\_\theta 2$ ). Even, there is a significant difference ( $1.52$ ) between the $\eta_{f}$ values of $GMKOC\mathchar 45\relax CDA\_\theta 1$ and $LMKOC\mathchar 45\relax LLE\_\theta 1$ . The above analysis indicates that an one-class classifier with better $\eta_{f}$ value has better generalization scapability compared to the other existing methods.

Overall, after the performance analysis of all the $15$ one-class classifiers, it is observed that none of the existing one-class classifiers perform better than the proposed multi-layer one-class classifiers in terms of any discussed performance criteria.

4 Conclusion

This paper has presented $4$ variants of Graph-Embedded multi-layer $KRR$ -based one-class classifier. It is constructed by stacking various Graph-Embedded Auto-Encoders followed by a Graph-Embedded $KRR$ -based one-class classifier. Stacked Graph-Embedded Auto-Encoder through multiple layers helps proposed classifiers in achieving better generalization and data representation capability. Overall, two types of training processes are involved i.e. one is for the Auto-Encoder and other is for the one-class classifier. We have explored two types of Graph-Embeddings, local and global variance-based embedding, in the kernel space of each layer using the Laplacian graph. $LLE$ and $CDA$ Laplacian graph are employed for local and global embedding, respectively. Extensive experimental comparisons have been provided with $11$ state-of-the-art kernel feature mapping based one-class classifiers over $21$ publicly available datasets in terms of $\eta_{g}$ , $\eta_{m}$ , $\eta_{p}$ , and $\eta_{f}$ . These experiments have exhibited that the proposed multi-layer one-class classifier provides state-of-the-art performance and outperformed all $11$ existing one-class classifiers. Moreover, the statistical significance of the results has also been verified by Friedman Ranking test. As per Friedman Rank, global variance-based proposed variants outperform local variance-based variants. In future work, various other types of available Auto-Encoder can be explored to enhance the performance of the proposed multi-layer architecture.

Funding Information: This research was supported by Department of Electronics and Information Technology (DeITY, Govt. of India) under Visvesvaraya PhD scheme for electronics & IT.

5 Compliance with Ethical Standards

Conflict of Interest: The authors declare that they have no conflict of interest. Ethical approval: This article does not contain any studies with human participants or animals performed by any of the authors.

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. M. Moya, M. W. Koch, and L. D. Hostetler. One-class classifier networks for target recognition applications. Technical report, Sandia National Labs., Albuquerque, NM (United States), 1993.
2[2] S. S. Khan and M. G. Madden. A survey of recent trends in one class classification. In Irish conference on Artificial Intelligence and Cognitive Science , pages 188–197. Springer, 2009.
3[3] M. A. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko. A review of novelty detection. Signal Processing , 99:215–249, 2014.
4[4] Y. Xu and C. Liu. A rough margin-based one class support vector machine. Neural Computing and Applications , 22(6):1077–1084, 2013.
5[5] J. Hamidzadeh and M. Moradi. Improved one-class classification using filled function. Applied Intelligence , pages 1–17, 2018.
6[6] Y. Xiao, B. Liu, L. Cao, X. Wu, C. Zhang, Z. Hao, F. Yang, and J. Cao. Multi-sphere support vector data description for outliers detection on multi-distribution data. In Data Mining Workshops, 2009. ICDMW’09. IEEE International Conference on , pages 82–87. IEEE, 2009.
7[7] A. RT Gepperth, T. Hecht, and M. Gogate. A generative learning approach to sensor fusion and change detection. Cognitive Computation , 8(5):806–817, 2016.
8[8] G. Luria, A. Kahana, and S. Rosenblum. Detection of deception via handwriting behaviors using a computerized tool: Toward an evaluation of malingering. Cognitive Computation , 6(4):849–855, 2014.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Graph-Embedded Multi-layer Kernel Extreme Learning Machine for One-class Classification

Abstract

Keywords:

1 Introduction

2 Proposed Method

2.1 Local Variance Information based Graph-Embedded Multi-layer KRRKRRKRR for One-class Classification: LMKOCLMKOCLMKOC

2.2 Global Variance Information based Graph-Embedded Multi-layer KRRKRRKRR for One-class Classification: GMKOCGMKOCGMKOC

2.3 Decision Function

3 Experimental Results

3.1 Nomenclature of the Proposed and Existing Methods

3.2 Range of the Parameters of the Proposed and Existing Methods

3.3 Performance Evaluation Criteria

3.4 Performance Comparison

3.5 Statistical Comparison

4 Conclusion

5 Compliance with Ethical Standards

2.1 Local Variance Information based Graph-Embedded Multi-layer $KRR$ for One-class Classification: $LMKOC$

2.2 Global Variance Information based Graph-Embedded Multi-layer $KRR$ for One-class Classification: $GMKOC$