Sparse Least Squares Low Rank Kernel Machines

Di Xu; Manjing Fang; Xia Hong; Junbin Gao

arXiv:1901.10098·cs.LG·October 22, 2019

Sparse Least Squares Low Rank Kernel Machines

Di Xu, Manjing Fang, Xia Hong, Junbin Gao

PDF

Open Access

TL;DR

This paper introduces LR-LSSVM, a sparse, low-rank kernel machine framework that enhances computational efficiency and model sparsity, validated through experiments with robust RBF kernels showing competitive or superior performance.

Contribution

The paper proposes a novel low rank kernel support vector machine framework with a two-step optimization algorithm, improving sparsity and efficiency over existing models.

Findings

01

Performance is comparable or superior to existing kernel machines.

02

The model achieves sparsity and computational efficiency.

03

Validated with experiments using robust RBF kernels.

Abstract

A general framework of least squares support vector machine with low rank kernels, referred to as LR-LSSVM, is introduced in this paper. The special structure of low rank kernels with a controlled model size brings sparsity as well as computational efficiency to the proposed model. Meanwhile, a two-step optimization algorithm with three different criteria is proposed and various experiments are carried out using the example of the so-call robust RBF kernel to validate the model. The experiment results show that the performance of the proposed algorithm is comparable or superior to several existing kernel machines.

Tables2

Table 1. TABLE I: The misclassification rate on synthetic data

	Testing Misclassification Rate (%)	Model Size
LSSVM-Gaussian ( $σ = 0.5$ )	11.40%	250
LSSVM-Gaussian ( $σ = 1.0$ )	9.20%	250
LSSVM-Gaussian ( $σ = 1.5$ )	10.40%	250
LSSVM-Gaussian ( $σ = 2.0$ )	10.10%	250
LSSVM-Gaussian ( $σ = 2.5$ )	10.10%	250
LSSVM-Gaussian ( $σ = 3.0$ )	9.80%	250
LSSVM-SBF	8.30%	4
Proposed Model (abs obj.)	8.00%	3
Proposed Model (square obj.)	8.30%	3
Proposed Model (target obj.)	8.00%	3

Table 2. TABLE II: The misclassification rate on different datasets

Models	Titanic		Diabetes		German Credit
	Misclassification Rate (%)	Mosel Size	Misclassification Rate (%)	Model Size	Misclassification Rate (%)	Model Size
RBF	23.3 $\pm$ 1.3	4	24.3 $\pm$ 2.3	15	24.7 $\pm$ 2.4	8
Adaboost with RBF	22.6 $\pm$ 1.2	4	26.5 $\pm$ 1.9	15	27.5 $\pm$ 2.5	8
AdaBoostReg	22.6 $\pm$ 1.2	4	23.8 $\pm$ 1.8	15	24.3 $\pm$ 2.1	8
LPReg-AdaBoost	24.0 $\pm$ 4.4	4	24.1 $\pm$ 1.9	15	24.8 $\pm$ 2.2	8
QPReg-AdaBoost	22.7 $\pm$ 1.1	4	25.4 $\pm$ 2.2	15	25.3 $\pm$ 2.1	8
SVM with RBF kernel	22.4 $\pm$ 1.0	not available	23.5 $\pm$ 1.7	not available	23.6 $\pm$ 2.1	not available
LSSVM-SBF	22.5 $\pm$ 0.8	2	23.5 $\pm$ 1.7	5	24.9 $\pm$ 1.9	3
Proposed Model (abs obj.)	22.3 $\pm$ 0.8	2	23.8 $\pm$ 1.7	5	25.6 $\pm$ 2.3	2
Proposed Model (square obj.)	22.6 $\pm$ 1.5	3	23.5 $\pm$ 2.0	4	24.7 $\pm$ 1.9	2
Proposed Model (target obj.)	22.4 $\pm$ 0.8	2	24.7 $\pm$ 2.0	5	25.6 $\pm$ 2.4	2

Equations107

x \in R^{D} \to ϕ (x) \in F,

x \in R^{D} \to ϕ (x) \in F,

k (x_{i}, x_{j}) = ⟨ ϕ (x_{i}), ϕ (x_{j})⟩,

k (x_{i}, x_{j}) = ⟨ ϕ (x_{i}), ϕ (x_{j})⟩,

y (x) = ⟨ ϕ (x), w ⟩ + b,

y (x) = ⟨ ϕ (x), w ⟩ + b,

w = n = 1 \sum N a_{n} t_{n} ϕ (x_{n}) .

w = n = 1 \sum N a_{n} t_{n} ϕ (x_{n}) .

k (x, X) = [k (x, x_{1}) k (x, x_{2}) \dots k (x, x_{N})]^{T} \in R^{N} .

k (x, X) = [k (x, x_{1}) k (x, x_{2}) \dots k (x, x_{N})]^{T} \in R^{N} .

y (x) = k (x, X)^{T} (a \circ t) + b .

y (x) = k (x, X)^{T} (a \circ t) + b .

K = k (x_{1}, x_{1}) ⋮ k (x_{N}, x_{1}) \dots ⋱ \dots k (x_{1}, x_{N}) ⋮ k (x_{N}, x_{N}) \in R^{N \times N}

K = k (x_{1}, x_{1}) ⋮ k (x_{N}, x_{1}) \dots ⋱ \dots k (x_{1}, x_{N}) ⋮ k (x_{N}, x_{N}) \in R^{N \times N}

Ω = t_{1} t_{1} k (x_{1}, x_{1}) ⋮ t_{N} t_{1} k (x_{N}, x_{1}) \dots ⋱ \dots t_{1} t_{N} k (x_{1}, x_{N}) ⋮ t_{N} t_{N} k (x_{N}, x_{N}) \in R^{N \times N} .

Ω = t_{1} t_{1} k (x_{1}, x_{1}) ⋮ t_{N} t_{1} k (x_{N}, x_{1}) \dots ⋱ \dots t_{1} t_{N} k (x_{1}, x_{N}) ⋮ t_{N} t_{N} k (x_{N}, x_{N}) \in R^{N \times N} .

s.t. a min (a \circ t)^{T} K (a \circ t) - 1^{T} a, 1^{T} (a \circ t) = 0, and 0 \leq a \leq C,

s.t. a min (a \circ t)^{T} K (a \circ t) - 1^{T} a, 1^{T} (a \circ t) = 0, and 0 \leq a \leq C,

w, b, η min \frac{1}{2} ∥ w ∥_{F}^{2} + \frac{γ}{2} n = 1 \sum N η_{n}^{2},

w, b, η min \frac{1}{2} ∥ w ∥_{F}^{2} + \frac{γ}{2} n = 1 \sum N η_{n}^{2},

t_{n} (⟨ w, ϕ (x_{n})⟩ + b) = 1 - η_{n}, n = 1, \dots, N,

[b a] = [0 t t^{T} Ω + I / γ]^{- 1} [01]

[b a] = [0 t t^{T} Ω + I / γ]^{- 1} [01]

ϕ_{j} (x; μ_{j}, c_{j}) = max {0, 1 - i = 1 \sum D μ_{i, j} ∣ x_{i} - c_{i, j} ∣},

ϕ_{j} (x; μ_{j}, c_{j}) = max {0, 1 - i = 1 \sum D μ_{i, j} ∣ x_{i} - c_{i, j} ∣},

k (x^{'}, x^{''}) = j = 1 \sum M ϕ_{j} (x^{'}; μ_{j}, c_{j})^{T} ϕ_{j} (x^{''}; μ_{j}, c_{j})

k (x^{'}, x^{''}) = j = 1 \sum M ϕ_{j} (x^{'}; μ_{j}, c_{j})^{T} ϕ_{j} (x^{''}; μ_{j}, c_{j})

y (x) = [α (x)]^{T} x + β (x) .

y (x) = [α (x)]^{T} x + β (x) .

β (x) = j \in S (x) \sum θ_{j} (1 - i = 1 \sum D μ_{i, j} c_{i, j} sign (c_{i, j} - x_{i})) + b

β (x) = j \in S (x) \sum θ_{j} (1 - i = 1 \sum D μ_{i, j} c_{i, j} sign (c_{i, j} - x_{i})) + b

α (x) = [α_{1} (x), \dots, α_{D} (x)]^{T}, in which

α_{i} (x) = j \in S (x) \sum θ_{j} μ_{i, j} sign (c_{i, j} - x_{i}), i = 1, \dots, D

θ_{j} = n = 1 \sum N a_{n} t_{n} ϕ_{j} (x_{n}; μ_{j}, c_{j}) .

θ_{j} = n = 1 \sum N a_{n} t_{n} ϕ_{j} (x_{n}; μ_{j}, c_{j}) .

[b a] = q - P \tilde{Φ} (I + \tilde{Φ}^{T} P \tilde{Φ})^{- 1} \tilde{Φ}^{T} q,

[b a] = q - P \tilde{Φ} (I + \tilde{Φ}^{T} P \tilde{Φ})^{- 1} \tilde{Φ}^{T} q,

P = \frac{1}{N} [- 1/ γ t t^{T} γ (N I - t t^{T})], q = \frac{1}{N} [t^{T} 1 γ (N I - t t^{T})]

P = \frac{1}{N} [- 1/ γ t t^{T} γ (N I - t t^{T})], q = \frac{1}{N} [t^{T} 1 γ (N I - t t^{T})]

\tilde{Φ} = [0 t \circ ϕ_{1} \dots \dots 0 t \circ ϕ_{M}],

\tilde{Φ} = [0 t \circ ϕ_{1} \dots \dots 0 t \circ ϕ_{M}],

ϕ_{j} (x; λ_{j}) : j = 1, 2, ..., M .

ϕ_{j} (x; λ_{j}) : j = 1, 2, ..., M .

λ_{j} = {μ_{j}, c_{j}} .

λ_{j} = {μ_{j}, c_{j}} .

ϕ_{j} (x; μ_{j}, c_{j}) = exp {- i = 1 \sum D μ_{i, j} ∣ x_{i} - c_{i, j} ∣} .

ϕ_{j} (x; μ_{j}, c_{j}) = exp {- i = 1 \sum D μ_{i, j} ∣ x_{i} - c_{i, j} ∣} .

ϕ_{r} : x \in R^{D} \to ϕ_{r} (x) = ϕ_{1} (x; λ_{1}) ⋮ ϕ_{M} (x; λ_{M})] \in F .

ϕ_{r} : x \in R^{D} \to ϕ_{r} (x) = ϕ_{1} (x; λ_{1}) ⋮ ϕ_{M} (x; λ_{M})] \in F .

k (x^{'}, x^{''}) = j = 1 \sum M ϕ_{j} (x^{'}; λ_{j})^{T} ϕ_{j} (x^{''}; λ_{j}) .

k (x^{'}, x^{''}) = j = 1 \sum M ϕ_{j} (x^{'}; λ_{j})^{T} ϕ_{j} (x^{''}; λ_{j}) .

w, b, η min \frac{1}{2} ∥ w ∥_{F}^{2} + \frac{γ}{2} n = 1 \sum N η_{n}^{2},

w, b, η min \frac{1}{2} ∥ w ∥_{F}^{2} + \frac{γ}{2} n = 1 \sum N η_{n}^{2},

t_{n} (⟨ w, ϕ_{r} (x_{n})⟩ + b) = 1 - η_{n}, n = 1, \dots, N .

L (w, b, η; a) =

L (w, b, η; a) =

- n = 1 \sum N a_{n} {t_{n} (

\frac{\partial L}{\partial w} = 0 \to

\frac{\partial L}{\partial w} = 0 \to

\frac{\partial L}{\partial b} = 0 \to

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and ELM · Face and Expression Recognition · Advanced Algorithms and Applications

Full text

Sparse Least Squares Low Rank Kernel Machines††thanks: Both Di Xu and Manjing Fang are students who enroll in Master of Commerce at the University of Sydney and have equal contribution to the project.

Di Xu and Manjing Fang

*Discipline of Business Analytics

The University of Sydney Business School*

*The University of Sydney

*Sydney, NSW 2006, Australia

{dixu3140,mfan9400}@uni.sydney.edu.au

Xia Hong

Department of Computer Science

*University of Reading

*Reading RG6 6AH, UK

[email protected]

Junbin Gao

*Discipline of Business Analytics

The University of Sydney Business School*

*The University of Sydney

*Sydney, NSW 2006, Australia

[email protected]

Abstract

A general framework of least squares support vector machine with low rank kernels, referred to as LR-LSSVM, is introduced in this paper. The special structure of low rank kernels with a controlled model size brings sparsity as well as computational efficiency to the proposed model. Meanwhile, a two-step optimization algorithm with three different criteria is proposed and various experiments are carried out using the example of the so-call robust RBF kernel to validate the model. The experiment results show that the performance of the proposed algorithm is comparable or superior to several existing kernel machines.

Index Terms:

Least Squares Support Vector Machine; Low Rank Kernels; Robust RBF Function; End-to-end modeling.

I Introduction

With the proliferation of big data in scientific and business research, in practical nonlinear modeling approaches, one wishes to build sparse models with more efficient algorithms. Kernel machines (KMs) have attracted great attention since the support vector machines (SVM), a well linear binary classification model under the principle of risk minimization, was introduced in earlier 1990s [1]. In fact, KMs have extended SVM by implementing the linearity in the so-called high dimensional feature space under a feature mapping implicitly determined by a Mercer kernel function. Both SVM and KMs have been also applied for regression problems [2]. Commonly used kernels are radial basis function kernel (RBF), polynomial kernel, and Fisher kernel [3], etc. As one of the most well-known members of the KM family, SVM has the advantages of good generalization and insensitivity to overfitting [4].

Until now Gaussian RBF kernel is the most common choice for SVM in practice. Generally, SVM with RBF kernel has been widely used and has superior prediction performance in many areas such as text categorization [5], image recognition [6], bioinformatics [7], credit scoring [8], time series forecasting [9], and weather forecasting [10]. Text categorization or text classification is to classify documents into predefined categories. SVM and KMs work well for this task because the high dimensional text or dense concept representation can be easily mapped into a latent feature space where a linear prediction model is learned with an appropriately chosen kernel function [11]. The results of the experiments indicate that SVM with RBF kernel outperforms other classification methods [5]. The superior performance of SVM with RBF kernel in dealing with high dimensional small datasets has also been demonstrated in remote sensing [12], by carefully choosing feature mappings.

The performance of SVM largely depends on kernel types and it has been shown that RBF kernel support vector machine is always capable of outperforming other classifiers in various classification scenarios [6, 5, 7]. Nonetheless, in practical nonlinear modeling, SVM with standard Gaussian RBF kernel has a non-negligible limitation in separating some nonlinear decision boundaries. Thus, the analysis of RBF kernel optimization has gained much more popularity than before. The result given in [13] demonstrates that after introducing an information-geometric data-dependent method to modify a kernel (eg, the RBF kernel), the performance of SVM is considerably improved. Yu et al. [14] enhance the kernel metrics by adding regularization into kernel machines (eg. RBF kernel SVM).

One of the advantages of the standard SVM model is its model sparsity determined by the so-called support vectors, however the sparsity cannot be pre-determined and the support vectors have to be learned from the training data by solving a computationally demanding quadratic programming optimization problem [15]. A massive progress in proposing computationally efficient algorithms for SVM models has been explored. One of the examples is the introduction of a least squares version of support vector machine (LSSVM) [16]. Instead of the margin constraints in the standard SVM, LSSVM introduces the equality constraints in the model formulation. The resulting quadratic programming problem can be solved by a set of linear equations [16]. However, LSSVM is loosing of sparseness offered by the original SVM method, which leads a kernel model evaluating all possible pairs of data in the kernel function and therefore is inferior to the standard SVM model in inference for large scale data learning. To maintain the sparsity offered by the standard SVM and the equality constraints of LSSVM, researchers considered extending LSSVM for the Ramp loss function and produce sparse models with extra computational complexity, see [17]. This strategy has been extended to more general insensitive loss function in [18]. Recently, Zhu et al. [19] proposed a way to select effective patterns from training datasets for fast support vector regression learning. However, there is no extension for classification problems yet.

The need in dealing with large scale datasets motivates exploring new approaches for the sparse models under the broad framework of both SVM and KMs. Chen [20] proposed a method for building a sparse kernel model by extending the so-called orthogonal least squares (OLS) algorithm [21] and kernel techniques. It seems the OLS assisted sparse kernel model offers an efficient learning procedure particularly demonstrating good performance in nonlinear system identification. The OLS algorithm relies on a greedy sequential selection of the kernel regressors under the orthogonal requirement imposing extra computational cost. Based on the so-called significant vectors, Gao et al. [22] proposed a more straightforward way to learn the significant regressors from training data for the kernel regression modelling. This type of approaches has their roots in the relevance vector machine (RVM) [23]. RVM is implemented under the Bayesian learning framework of kernel machine and has a comparable inference performance to the standard SVM with dramatically fewer kernel terms, offering great sparsity.

Almost all the aforementioned modeling methods build models by learning or extracting those key data points or patterns from the entire training dataset. Recently, the authors proposed a new type of low rank kernel model based on the so-called simplex basis functions (SBF) [15], successfully building a sparse and fast modeling algorithm thus lowering the computational cost in LSSVM. The model size is no longer determined by the given training data while the key patterns will be learned straightaway. We further explore the idea and extend it for the so-called robust radial basis functions. The main contributions of this paper are summarized as follows,

Given that the aforementioned models learn data patterns under the the regression setting, this paper focuses on classification setting for a controlled or pre-defined model size; 2. 2.

The kernel function proposed in this paper takes the form of composition of basic basis components which are adaptive to the training data. This composition form opens the door for a fast closed form solution, avoiding the issue of kernel matrix inversion in the case of large scale datasets; 3. 3.

A new criterion is proposed for the final model selection in terms of pattern parameters of location and scale; and 4. 4.

A two-step optimization algorithm is proposed to simplify the learning procedure.

The rest of this paper is organized as follows. In Section II, we present the brief background on several related models. Section III proposes our robust RBF kernel function and its classification model. Section IV describes the artificial and real-world datasets and conducts several experiments to demonstrate the performance of the model and algorithm and Section V concludes the paper.

II Background and Notation

In this section, we start introducing necessary notation for the purpose of presenting our model and algorithm. We mainly consider binary classification problems. For the multi-class classification problems, as usual, the commonly used heuristic approach of “one-vs-all” or “one-vs-one” can be adopted.

Given a training dataset $\mathcal{D}=(\boldsymbol{X},\boldsymbol{t})=\{(\boldsymbol{x}_{n},t_{n})\}^{N}_{n=1}$ where $N$ is the number of data, $\boldsymbol{x}_{n}\in\mathbb{R}^{D}$ is the feature vector and $t_{n}\in\{-1,1\}$ is the label for the $n$ -th data respectively.

KM methods have been used as a universal approximator in data modeling. The core idea of the KMs is to implement a linear model in a high dimensional feature space by using a feature mapping $\boldsymbol{\phi}$ defined as [1]

[TABLE]

which induces a Mercer kernel function in the input space

[TABLE]

where $\langle\cdot,\cdot\rangle$ is the inner product on the feature space $\mathcal{F}$ .

In general, an affine linear model of KMs is defined as

[TABLE]

where $b\in\mathbb{R}$ is the bias parameter and $\boldsymbol{w}\in\mathcal{F}$ is the parameter vector of high dimensionality, most likely in infinite dimension. It is infeasible to solve for the parameter vector $\boldsymbol{w}$ directly. Instead, the so-called kernel trick transforms the infinite dimension problem to a finite dimension problem by relating the parameters $\boldsymbol{w}$ to the data as

[TABLE]

A learning algorithm will focus on solving for $N$ parameters $\boldsymbol{a}=(a_{1},a_{2},...,a_{N})^{T}\in\mathbb{R}^{N}$ under an appropriate learning criterion.

For the sake of convenience, define

[TABLE]

Then, under (2), model (1) can be expressed in terms of new parameters $\boldsymbol{a}$ as111If we are considering a regression problem, there is no need to add $t_{n}$ in the model (3).

[TABLE]

where $\circ$ means the component-wise product of two vectors.

All the KMs algorithms are involved with the so-called kernel matrix, as defined below

[TABLE]

and

[TABLE]

Both $\boldsymbol{K}$ and $\boldsymbol{\Omega}$ are symmetric matrices of size $N\times N$ .

In the following section, standard SVM, LSSVM and sparse least square support vector machine using simplex basis function (LSSVM-SBF) [15] are outlined.

II-A C-SVM

The standard support vector machine (C-SVM) imposes the so-called maximal margin criterion inducing a kernel model where the parameter $\boldsymbol{a}$ (and $b$ ) can be obtained by solving the following dual Lagrangian problem

[TABLE]

where $\boldsymbol{1}$ is the vector with all ones in appropriate dimension. The parameter $b$ can be easily calculated from the support vectors [1].

The margin criterion guarantees that the resulting kernel model (3) is sparse, as only those parameters $a_{n}$ corresponding to the support vectors $\boldsymbol{x}_{n}$ are non-zero. However, when $N$ is large, solving the convex quadratic programming problem (4) to identify such support vectors is very time consuming.

II-B LSSVM

To reduce the computational complexity of the standard SVM, the least square support vector machine introduces the equality constraints.

The standard LSSVM is formulated in the following programming problem

[TABLE]

where $\gamma>0$ is a penalty parameter.

With the given equality constraints, the Lagrangian multiplier method produces a kernel model (3) such that the parameters $\boldsymbol{a}$ and $b$ are given by the following set of closed form linear equations

[TABLE]

where $\boldsymbol{I}$ is the identity matrix of size $N\times N$ . However, the computational hurdle lies in the massive matrix inverse in (6) which has complexity of order $O(N^{3})$ .

II-C LSSVM-SBF

Despite of a close formed solution obtained by LSSVM, the model has two main limitations. First, calculating the matrix inversion is computationally demanding and second, the model is non-sparse which means that it has to compute all possible pairs of system inputs, making the model infeasible for large-sized datasets. Alternatively, we have proposed a novel kernel method referred to as LSSVM-SBF [15], which can overcome these two issues by introducing symmetric structure in specially designed kernel function based on the so-called low rank Simplex Basis Function (SBF) kernel.

The SBF $\phi_{j}(\boldsymbol{x};\boldsymbol{\mu}_{j},\boldsymbol{c}_{j})$ is defined as

[TABLE]

where $\boldsymbol{c}_{j}=[c_{1,j},\cdots,c_{D,j}]^{T}\in\mathbb{R}^{D}$ and $\boldsymbol{\mu}_{j}=[\mu_{1,j},\cdots,\mu_{D,j}]^{T}\in\mathbb{R}^{D}_{+}$ are the center vector of the $j$ th SBF function that adjusts the location and the shape vector of the $j$ th SBF that adjusts the shape respectively. The proposed new kernel in [15] is defined as

[TABLE]

in which the SBF kernels use only $M\ll N$ basis functions. $M$ is the pre-defined model size.

It has been proved in [15] that, under the kernel (8) with the SBF (7), the resulting model is piecewise locally linear with respect to the input $\boldsymbol{x}$ as

[TABLE]

Here we have defined

[TABLE]

where $S(\boldsymbol{x})\in[1,2,...,M]$ is the index set of $j$ , satisfying condition $\sum^{D}_{i=1}\mu_{i,j}|x_{i}-c_{i,j}|<1$ , and

[TABLE]

With the low rank kernel structure defined as (8), the closed form solution (6) for $\boldsymbol{a}$ and $b$ can be rewritten as, see [15],

[TABLE]

where

[TABLE]

and

[TABLE]

with $\boldsymbol{\phi}_{j}=[\phi_{j}(\boldsymbol{x}_{1};\boldsymbol{\mu}_{j},\boldsymbol{c}_{j}),\phi_{j}(\boldsymbol{x}_{2};\boldsymbol{\mu}_{j},\boldsymbol{c}_{j}),...,\phi_{j}(\boldsymbol{x}_{N};\boldsymbol{\mu}_{j},\boldsymbol{c}_{j})]^{T}$ , i.e., the vector of basis function values at the training inputs.

The new solution (10) only involves the matrix inverse of size $M\times M$ , which is superior to (6) where the inverse is of size $N\times N$ .

III The Proposed Model and Its Algorithm

From subsection II-C, we have found that the special choice of low rank SBF kernel as defined in (7) and (8) brings model efficiency. To extend the idea of using low rank kernel, in this section, we will propose a general framework for fast algorithm and validate it with several examples.

We would like to emphasize that our idea of using low rank kernel is inspired by the original low rank kernel approximation such as Nyström approximation [24]. However the standard low rank kernel methods aim to approximate a given kernel function, while our approach involves learning (basis) functions and constructs the kernel with composite structure in order to assist fast algorithms.

III-A The Low Rank Kernels and Models

Consider $M$ learnable “basis” functions

[TABLE]

with adaptable parameters $\boldsymbol{\lambda}_{j}$ ( $j=1,2,...,M$ ). In the case of SBF in (7), we have in total $2D$ parameters

[TABLE]

As another example, we will consider the so-called robust RBF

[TABLE]

Similar to the SBF, while $c_{i,j}$ determines the location of $\phi_{j}(\boldsymbol{x};\boldsymbol{\mu}_{j},\boldsymbol{c}_{j})$ in the $i$ th dimensional direction, $\mu_{i,j}$ restricts the sharpness of $\phi_{j}(\boldsymbol{x};\boldsymbol{\mu}_{j},\boldsymbol{c}_{j})$ in the $i$ th dimension. In fact, the SBF (7) can be regarded as the first order approximation of the robust RBF in terms of $\exp\{-t\}=1-t+\frac{1}{2!}t^{2}+\cdots$ . We expect the robust RBF will have better modeling capability.

More generally, each learnable basis function $\phi_{j}(\boldsymbol{x};\boldsymbol{\lambda}_{j})$ can be a deep neural network. We will leave this for further study.

Given a set of learnable basis functions (11), define a finite dimensional feature mapping

[TABLE]

This feature mapping naturally induces the following learnable low rank kernel

[TABLE]

Consider the “linear” model $y(\boldsymbol{x})=\langle\boldsymbol{w},\boldsymbol{\phi}_{r}(\boldsymbol{x})\rangle+b$ and define the following low rank LSSVM (LR-LSSVM)

[TABLE]

LR-LSSVM problems takes the same form as the standard LSSVM (5), however our low rank kernel carries composition structure and is learnable with adaptable parameters. In the following subsections, we propose a two-steps alternative algorithm procedure to solve the LR-LSSVM.

III-B Solving LR-LSSVM with Fixed Feature Mappings

When all the feature mappings $\phi_{j}(j=1,2,...,M)$ are fixed, problem (14) gives back to the standard LSSVM. Denote $\boldsymbol{\eta}=[\eta_{1},\eta_{2},...,\eta_{N}]^{T}$ and consider the Lagrangian function

[TABLE]

where $\boldsymbol{a}=[a_{1},a_{2},...,a_{N}]^{T}$ are Lagrange multipliers for all the equality constraints. We now optimize out $\boldsymbol{w}$ , $b$ and $\boldsymbol{\eta}$ to give

[TABLE]

where

[TABLE]

Furthermore, setting the partial derivative with respect to each Lagrange multiplier gives

[TABLE]

Taking (15) into (19) we have

[TABLE]

After a long algebraic manipulation, the solution for the dual problem is given by

[TABLE]

Denote $\widetilde{\boldsymbol{\Phi}}$ the $(N+1)\times M$ matrix with one row of all zeros on the top of matrix $\text{diag}(\boldsymbol{t})\boldsymbol{\Phi}$ , then the solution can be expressed as

[TABLE]

Applying the matrix inversion formula to (20) results in the exactly same solution as (10). Once $\boldsymbol{a}$ and $b$ are worked out, the final model can be written as

[TABLE]

Define

[TABLE]

which can be calculated after $\boldsymbol{a}$ is known, then (21) can be expressed in terms sparse form of size $M$

[TABLE]

III-C Training Learnable Low Rank Kernels

Given $\boldsymbol{a},b$ which are solved by the closed-form solution in the first step, we estimate the kernel parameters $\boldsymbol{\lambda}_{j}$ ( $j=1,\dots,M$ ) using a gradient descent algorithm. The algorithm seeks to maximize the magnitude of model outputs, which leads to overall further distance from the model outputs to the existing decision boundary. Taking the robust RBF functions (12) as an example, this objective function can be expressed as

[TABLE]

Another objective function is

[TABLE]

which gives similar results as (23).

Denote $\text{sign}(\boldsymbol{y})=[\text{sign}(y(\boldsymbol{x}_{1})),\dots,\text{sign}(y(\boldsymbol{x}_{N}))]^{T}$ . Given the objective function above, we have

[TABLE]

in which

[TABLE]

where

[TABLE]

which are calculated by, for $i=1,...,D$ ,

[TABLE]

where $\phi_{j}(\boldsymbol{x};\boldsymbol{\mu}_{j},\boldsymbol{c}_{j})$ is defined in (12).

Meanwhile, we should also consider the positivity constraints for the shape parameters vector $\boldsymbol{\mu}_{j}$ and thus, we have the following constrained normalized gradient descent algorithm, which is, for $i=1,\dots,D$ ,

[TABLE]

where $\eta>0$ is a preset learning rate. By applying (24) to (29) to all $M$ Robust RBF units while keeping $b,\boldsymbol{a}$ to their current values and other RBF units constant, we manage to update all RBF kernels.

III-D Initialization of Robust Radial Basis Functions

As is shown in (22), the model requires a preset kernel model size $M$ and a set of initial kernel parameters $\boldsymbol{\lambda}_{j}$ , $j=1,\dots,M$ . In the case of robust RBFs, both $\boldsymbol{c}_{j}$ and $\boldsymbol{\mu}_{j}$ need to be initialized. The initialization of the center vector $\boldsymbol{c}_{j}$ can be obtained using a clustering algorithm. We propose a $k$ -medoids algorithm here to solve for the Robust RBF centers since it is more robust to unbalanced distribution of data. It seeks to divide the data points into $M$ subsets and iteratively adjust the centers $\boldsymbol{c}_{j}$ of each subset $S_{j}$ until reaching convergence while minimizing the clustering objective objection given by

[TABLE]

where the centers $\boldsymbol{c}_{j}$ of each subset are the members of that subset. As for the initial values of the shaping parameters $\boldsymbol{\mu}_{j}$ , we preset $\mu_{i,j}$ as a predetermined constant for all basis functions, e.g., 1s.

III-E The Overall Algorithm and Its Complexity

Algorithm 1222The algorithm can be easily adopted to any learnable kernels. summarizes the overall procedure of LR-LSSVM using the example of robust RBF kernel. The algorithm starts with the k-medoids clustering algorithm for initialization of the robust RBF centres in Section III-B, then the fast LSSVM solution is achieved and the gradient descent algorithm in Section III-C or III-F are alternatively applied for a predefined number of iterations. A simple complexity analysis indicates that the overall computational complexity is $O(M^{2}N)$ which is dominated by the gradient descent algorithm for training learnable basis functions, scaled by the iteration number. Many examples in Section IV have shown that a minor size $M$ gives competitive model prediction performance. In this sense, the newly proposed algorithm has a complexity of $O(N)$ . The lower complexity benefits from the special structure of low rank kernel functions. It should be pointed out again that the proposed framework contains the SBF model in [15] as a special case, that the framework can be applied for more generic extension, for example using deep neural networks for learning kernel functions.

III-F The Differentiable Objective Functions

The objective defined in (23) is non-differentiable. For the purpose of maximizing the magnitude of model outputs, we propose the following squared objective which is differentiable, for $j=1,2,...,M$ ,

[TABLE]

Then according to (21), we can write (31) as

[TABLE]

It is not hard to prove that

[TABLE]

and the chain rule gives

[TABLE]

where $\text{tr}()$ means the trace of matrix, $\nu$ means either $\mu_{i,j}$ or $c_{i,j}$ , and $\circ$ means the matrix element-wise product. Combining (33) and (34) gives

[TABLE]

where $\frac{\partial\mathbf{K}}{\partial\nu}$ is the matrix given by (24) and (25).

IV Experimental Studies

IV-A Example 1: Synthetic Dataset

For synthetic data set in [25], the dimension of input space is $D=2$ , and the training and test sample sets are in the size of 250 and 1000 respectively. In this example, three types of models are constructed to generate classification performance comparison by using the metric of misclassification rate. For LSSVM with Gaussian RBF kernel models, the steepness $\sigma$ is set in the range of 0.5-3, step 0.5, while shrinkage $\gamma$ is all set into 5000. For the LR-LSSVM-SBF model, the parameters are preset to $M=4,\mu=0.2,T=100,\eta=0.02,\gamma=200$ . For our proposed LR-LSSVM-Robust RBF models with absolute value, squared and targeted objective functions, the parameters are set into $M=3,\mu=0.2,\gamma=150,\eta=0.0008,T=100$ ; $M=3,\mu=0.2,\gamma=20,\eta=0.0005,T=100$ and $M=3,\mu=0.2,\gamma=150,\eta=0.0008,T=100$ respectively.

From the classification results shown in TABLE I, we can find that the proposed LR-LSSVM-Robust RBF and LR-LSSVM-SBF models dominate all the time with the misclassification rates of around 8 $\%$ , while Gaussian RBF kernel models perfrom fairly poor in this case. In Fig 1, we can see that the decision boundary of LSSVM with Gaussian RBF kernel is relatively curvey and nonlinear, whereas the ones for SBF and Robust RBF are in piecewise linear forms.

IV-B Example 2: Titanic Dataset

For the Titanic data set in [26], it has 100 realizations and each has 150 training samples and 2051 test samples respectively. The original data has the input dimension of $D=3$ . We compare the prediction accuracy of various Adaboost-based models and the LR-LSSVM models over the test samples. For the LR-LSSVM-SBF model, the parameters are set into $M=2,\mu=0.2,T=100,\eta=0.05,\gamma=50000$ , while for the proposed models with absolute value, squared and targeted objective functions, the parameters are set as $M=2$ , $\mu=0.03$ , $\gamma=50000$ , $\eta=0.0005$ , $T=100$ ; $M=3$ , $\mu=0.001$ , $\gamma=500000$ , $\eta=0.0001$ , $T=100$ and $M=2$ , $\mu=0.001$ , $\gamma=50000$ , $\eta=0.0001$ , $T=100$ respectively.

The result of the proposed models is shown in TABLE II (columns 2 & 3) together with the first six other results quoted from [26] and the seventh result quoted from [15]. Generally, LR-LSSVM-SBF and the proposed LR-LSSVM models with Robust RBF kernel outperform other models and all the LR-LSSVM models are sparse with only 2 terms (except for the model with squared loss function). Also, we can observe that the LR-LSSVM models with absolute value and targeted objective function have similar prediction results. Overall, the proposed models with absolute value and targeted objective functions perform the best with the lowest misclassification rate and standard deviation since the final model size of the Robust RBF kernels is only 2, which makes it easy for the models to explain the data.

IV-C Example 3: Diabetes Dataset

For diabetes data set in [26], it has 100 groups of training and test samples individually, with the size of training set equal to 468 and the size of test set equal to 300. The input space of this example is $D=8$ . Similar to the main structure of titanic data set, here, for comparison, we will use ten different models and the measurement metric of average misclassification rate as well. For the LR-LSSVM-SBF model, the parameters are set into $M=5,\mu=0.2,T=100,\eta=0.05,\gamma=50000$ , while for the proposed models with absolute value, squared and targeted objective functions, the parameters are set as $M=5,\mu=0.01,T=100,\eta=0.001,\gamma=50000$ ; $M=4$ , $\mu=0.001$ , $\gamma=50000$ , $\eta=0.001$ , $T=100$ and $M=5$ , $\mu=0.0001$ , $\gamma=50000$ , $\eta=0.001$ , $T=100$ respectively.

The modeling results in TABLE II (columns 4 & 5) show that the performance of the proposed LR-LSSVM-Robust RBF models with absolute value and squared objective functions are competitive in the ten models with the classification accuracy almost ranking at the top. Moreover, it can be seen that the SBF kernel and the proposed Robust RBF kernel bring sparsity into the LR-LSSVM models, which considerably increases the programming speed during computation.

IV-D Example 4: German Credit Dataset

Similarly, German credit dataset in [26] has 100 realizations of training and test sets. Each realization contains 700 training samples and 300 test samples. The original data has the 20 features. We evaluate the misclassification rate of our proposed models with various objective functions and the LR-LSSVM-SBF model along with the six other models. For the parameters of the LR-LSSVM-SBF model, we set $M=2$ , $\mu=0.005$ , $\gamma=200000$ , $\eta=0.003$ , $T=100$ while for the proposed LR-LSSVM-Robust RBF models with absolute value, squared and targeted objective functions, the parameters are set into $M=2,\mu=0.005,\gamma=200000,\eta=0.003,T=100$ for all three cases.

The results of the four models are listed in TABLE II (columns 6 & 7) together with the first six other results quoted from [26]. For this data set both LR-LSSVM-SBF and LR-LSSVM-Robust RBF do not perform as well as they do in the previous data sets. However, the prediction accuracy together with the standard deviation are still comparable. Additionally, it can been seen that the model size of the four models is relatively small compared to other models.

IV-E Summary

Overall, we can notice that the proposed squared objective model perfroms well in high dimensional datasets, which include the diabetes and german examples in our demonstration, whereas the proposed absolute value and targeted objective models are more suitable for low dimensional input, which are the synthetic and titanic datasets in our cases. Moreover, there is no relation between input dimension and chosen model size since in the four result tables, we can observe that the final selected $M$ is relatively random in general.

V Conclusions

In this paper we have generalized a widely-applied framework for fast LR-LSSVM algorithm and then extended this idea to the novel robust RBF kernel. After initialising the proposed kernel parameters with k-medoids clustering, the working procedures of training algorithm are alternating between fast least square closed form solution for $\boldsymbol{a},b$ and gradient descent for $\boldsymbol{c},\boldsymbol{\mu}$ sub-algorithms. For the gradient descent section, three criteria are offered - two non-differentiable (absolute value and targeted) and one differentiable (squared) objective functions with squared objective working better in the case of high dimensional input and the rest targeting more on low dimensional data. In the end, for the aim of demonstrating the effectiveness of our proposed algorithm, simple synthetic as well as several real-world data sets are validated in comparison with other known approaches.

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Schölkopf and A. J. Smola, Learning with Kernels . MIT Press, 2002.
2[2] C. Bishop, Pattern Recognition and Machine Learning . Springer, 2006.
3[3] T. S. Jaakkola and D. Haussler, “Exploiting generative models in discriminative classifiers,” Advances in Neural Information Processing Systems , pp. 487–493, 1998.
4[4] F. Lotte, M. Congedo, A. Lécuyer, F. Lamarche, and B. Arnaldi, “A review of classification algorithms for EEG-based brain-computer interfaces,” J. of Neural Engineering , vol. 4, no. 2, pp. R 1–13, 2007.
5[5] T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Lecture Notes in Computer Science . Springer, 1998, vol. 1398, pp. 137–142.
6[6] E. Gumus, N. Kilic, A. Sertbas, and O. N. Ucan, “Evaluation of face recognition techniques using PCA, Wavelets and SVM,” Expert Systems with Applications , vol. 37, no. 9, pp. 6404–6408, 2010.
7[7] M. Pirooznia, J. Y. Yang, M. Q. Yang, and Y. Deng, “A comparative study of different machine learning methods on microarray gene expression data,” BMC Genomics , vol. 9, no. Suppl 1, 2008.
8[8] J. Min and Y. Lee, “Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters,” Expert Systems with Applications , vol. 28, no. 4, pp. 603–614, 2005.