Maximum Correntropy Criterion with Variable Center

Badong Chen; Xin Wang; Yingsong Li; Jose C. Principe

arXiv:1904.06501·stat.ML·July 24, 2019

Maximum Correntropy Criterion with Variable Center

Badong Chen, Xin Wang, Yingsong Li, Jose C. Principe

PDF

TL;DR

This paper introduces a novel extension of the maximum correntropy criterion that allows the kernel center to vary, improving flexibility and performance in signal processing tasks.

Contribution

The paper proposes MCC-VC, an extended correntropy measure with a variable kernel center, along with an optimization approach for kernel parameters.

Findings

01

Enhanced regression performance in simulations

02

Flexible kernel positioning improves robustness

03

Efficient optimization of kernel parameters

Abstract

Correntropy is a local similarity measure defined in kernel space and the maximum correntropy criterion (MCC) has been successfully applied in many areas of signal processing and machine learning in recent years. The kernel function in correntropy is usually restricted to the Gaussian function with center located at zero. However, zero-mean Gaussian function may not be a good choice for many practical applications. In this study, we propose an extended version of correntropy, whose center can locate at any position. Accordingly, we propose a new optimization criterion called maximum correntropy criterion with variable center (MCC-VC). We also propose an efficient approach to optimize the kernel width and center location in MCC-VC. Simulation results of regression with linear in parameters (LIP) models confirm the desirable performance of the new method.

Tables3

Table 1. TABLE I: RMSE AND COMPUTING TIME (sec) OF DIFFERENT CRITERIA

		MMSE	MCC	MCC-VC
case 1)	RMSE	$1.2374 \pm 0.6840$	0.0765 $\pm 0.0422$	$0.0902 \pm 0.0547$
case 1)	TIME(sec)	N/A	$1.2217 \pm 0.0269$	$0.0962 \pm 0.0023$
case 2)	RMSE	$1.2214 \pm 0.6441$	$0.1375 \pm 0.0737$	$0.0505 \pm 0.0272$
case 2)	TIME(sec)	N/A	$1.2214 \pm 0.0253$	$0.0976 \pm 0.0024$
case 3)	RMSE	$1.2435 \pm 0.6218$	$0.0337 \pm 0.0168$	$0.0332 \pm 0.0165$
case 3)	TIME(sec)	N/A	$1.2805 \pm 0.0613$	$0.0957 \pm 0.0032$
case 4)	RMSE	$1.1317 \pm 0.5763$	$0.1546 \pm 0.0762$	$0.0910 \pm 0.0441$
case 4)	TIME(sec)	N/A	$1.2157 \pm 0.0249$	$0.0978 \pm 0.0022$

Table 2. TABLE II: TRAINING AND TESTING RMSEs OF THREE ALGORITHMS

Datasets	RELM		ELM-RCC		ELM-MCC-VC
Datasets	Training RMSE	Testing RMSE	Training RMSE	Testing RMSE	Training RMSE	Testing RMSE
Servo	$0.0600 \pm 0.0095$	$0.1088 \pm 0.0171$	$0.0831 \pm 0.0219$	$0.1064 \pm 0.0165$	$0.0835 \pm 0.0225$	$0.1029 \pm 0.0179$
Airfoil	$0.0974 \pm 0.0074$	$0.1031 \pm 0.0077$	$0.0942 \pm 0.0022$	$0.0997 \pm 0.0028$	$0.0812 \pm 0.0038$	$0.0923 \pm 0.0054$
Concrete	$0.0738 \pm 0.0021$	$0.0965 \pm 0.0055$	$0.0823 \pm 0.0025$	$0.0945 \pm 0.0034$	$0.0642 \pm 0.0033$	$0.0927 \pm 0.0049$
Housing	$0.0439 \pm 0.0042$	$0.0921 \pm 0.0137$	$0.0442 \pm 0.0042$	$0.0907 \pm 0.0138$	$0.0455 \pm 0.0040$	$0.0903 \pm 0.0137$
Yacht	$0.0366 \pm 0.0093$	$0.0823 \pm 0.0090$	$0.0575 \pm 0.0023$	$0.0769 \pm 0.0053$	$0.0041 \pm 0.0003$	$0.0232 \pm 0.0105$
Wine-red	$0.1205 \pm 0.0036$	$0.1350 \pm 0.0044$	$0.1171 \pm 0.0027$	$0.1309 \pm 0.0035$	$0.1209 \pm 0.0025$	$0.1299 \pm 0.0031$
Slump	$0.0081 \pm 0.0011$	$0.0461 \pm 0.0095$	$0.0000 \pm 0.0000$	$0.0433 \pm 0.0102$	$0.0000 \pm 0.0000$	$0.0412 \pm 0.0106$

Table 3. TABLE III: Specification of the datasets

Datasets	Features	Observations
Datasets	Features	Training	Testing
Servo	5	83	83
Airfoil	5	751	751
Concrete	9	515	515
Housing	14	253	253
Yacht	6	154	154
Wine-red	12	799	799
Slump	10	52	51

Equations45

M^{*} = M \in M ar g max V_{σ} (T, Y) = E [G_{σ} (e)]

M^{*} = M \in M ar g max V_{σ} (T, Y) = E [G_{σ} (e)]

G_{σ} (e) = \frac{1}{2 π σ} e x p (- \frac{e ^{2}}{2 σ ^{2}})

G_{σ} (e) = \frac{1}{2 π σ} e x p (- \frac{e ^{2}}{2 σ ^{2}})

V_{σ, c} (T, Y) = E [G_{σ} (e - c)] = E [\frac{1}{2 π σ} e x p (- \frac{( e - c ) ^{2}}{2 σ ^{2}})]

V_{σ, c} (T, Y) = E [G_{σ} (e - c)] = E [\frac{1}{2 π σ} e x p (- \frac{( e - c ) ^{2}}{2 σ ^{2}})]

V_{σ, c} (T, Y) = \frac{1}{2 π σ} n = 0 \sum \infty \frac{( - 1 ) ^{n}}{2 ^{n} n !} E [\frac{( e - c ) ^{2 n}}{σ ^{2 n}}]

V_{σ, c} (T, Y) = \frac{1}{2 π σ} n = 0 \sum \infty \frac{( - 1 ) ^{n}}{2 ^{n} n !} E [\frac{( e - c ) ^{2 n}}{σ ^{2 n}}]

σ \to 0 lim V_{σ, c} (T, Y)

σ \to 0 lim V_{σ, c} (T, Y)

= \iint δ (t - y - c) p_{T Y} (t, y) d t d y

= \int_{- \infty}^{\infty} p_{T Y} (t, t - c) d t

σ \to 0 lim V_{σ, c} (T, Y)

σ \to 0 lim V_{σ, c} (T, Y)

= \int δ (ε - c) p_{e} (ε) d ε

= p_{e} (c)

M^{*} = M \in M ar g max V_{σ, c} (T, Y) = E [G_{σ} (e - c)]

M^{*} = M \in M ar g max V_{σ, c} (T, Y) = E [G_{σ} (e - c)]

y_{i}

y_{i}

, i

J_{M M S E} (β) = ∥ T - Y ∥^{2} + λ ∥ β ∥^{2}

J_{M M S E} (β) = ∥ T - Y ∥^{2} + λ ∥ β ∥^{2}

β^{*} = (H^{T} H + λ I)^{- 1} H^{T} T

β^{*} = (H^{T} H + λ I)^{- 1} H^{T} T

J_{M C C - V C} (β) = - \frac{1}{N} i = 1 \sum N [G_{σ} (e_{i} - c)] + λ ∥ β ∥^{2}

J_{M C C - V C} (β) = - \frac{1}{N} i = 1 \sum N [G_{σ} (e_{i} - c)] + λ ∥ β ∥^{2}

β^{*} = [H^{T} ΛH + λ^{'} I]^{- 1} H^{T} Λ T^{'}

β^{*} = [H^{T} ΛH + λ^{'} I]^{- 1} H^{T} Λ T^{'}

β_{k} = ([H^{T} ΛH + λ^{'} I]^{- 1} H^{T} Λ T^{'})_{β_{k - 1}}

β_{k} = ([H^{T} ΛH + λ^{'} I]^{- 1} H^{T} Λ T^{'})_{β_{k - 1}}

V_{σ, c} (T, Y)

V_{σ, c} (T, Y)

= \frac{1}{2} \int [G_{σ} (ε - c)]^{2} d ε + \frac{1}{2} \int [p_{e} (ε)]^{2} d ε

- \frac{1}{2} \int [G_{σ} (ε - c) - p_{e} (ε)]^{2} d ε

M^{*} = M \in M ar g max V_{σ, c} (T, Y) = M \in M ar g max U_{σ, c} (T, Y)

M^{*} = M \in M ar g max V_{σ, c} (T, Y) = M \in M ar g max U_{σ, c} (T, Y)

(M^{*}, σ^{*}, c^{*}) = M \in M, σ \in S, c \in C ar g max U_{σ, c} (T, Y)

(M^{*}, σ^{*}, c^{*}) = M \in M, σ \in S, c \in C ar g max U_{σ, c} (T, Y)

(σ^{*}, c^{*})

(σ^{*}, c^{*})

= σ \in S, c \in C ar g min {\int [G_{σ} (ε - c)]^{2} d ε - 2 E [G_{σ} (e - c)]}

= σ \in S, c \in C ar g min {\frac{1}{2 π σ} - 2 E [G_{σ} (e - c)]}

(σ^{*}, c^{*}) = σ \in S, c \in C ar g min {\frac{1}{2 π σ} - \frac{2}{N} i = 1 \sum N G_{σ} (e_{i} - c)}

(σ^{*}, c^{*}) = σ \in S, c \in C ar g min {\frac{1}{2 π σ} - \frac{2}{N} i = 1 \sum N G_{σ} (e_{i} - c)}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Maximum Correntropy Criterion with Variable Center

Badong Chen, Xin Wang, Yingsong Li, and Jose C. Principe This work was supported by National Key R&D Program of China (No. 2017YFB1002501), 973 Program (No. 2015CB351703) and National NSF of China (No. 91648208, No. U1613219).Badong Chen and Xin Wang([email protected] and wangxin0420@

stu.xjtu.edu.cn) are with the School of Electronic and Information Engineering,Xi’an Jiaotong University, Xi’an 710049, Shaanxi, China.Yingsong Li([email protected]) is with the College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China and also with the National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China.Jose C. Principe([email protected]) is with the School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, Shaanxi, China and also with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, USA.

Abstract

Correntropy is a local similarity measure defined in kernel space and the maximum correntropy criterion (MCC) has been successfully applied in many areas of signal processing and machine learning in recent years. The kernel function in correntropy is usually restricted to the Gaussian function with center located at zero. However, zero-mean Gaussian function may not be a good choice for many practical applications. In this study, we propose an extended version of correntropy, whose center can locate at any position. Accordingly, we propose a new optimization criterion called maximum correntropy criterion with variable center (MCC-VC). We also propose an efficient approach to optimize the kernel width and center location in MCC-VC. Simulation results of regression with linear in parameters (LIP) models confirm the desirable performance of the new method.

Index Terms:

Correntropy, maximum correntropy criterion (MCC), maximum correntropy criterion with variable center (MCC-VC), robust learning.

I Introduction

One of the most important problems in machine learning is how to approximate a target random variable ( $T$ ) knowing another ( $Y$ ). This is a central problem in supervised learning, where we design a model ( $M$ ) that receives a random variable $X$ and outputs $Y$ that should approximate $T$ in some sense. The difficulty requires the definition of a loss function (or a similarity measure) to compare $Y$ with $T$ . The minimum mean square error (MMSE) criterion is widely used where the loss function is $E\left[{{e^{2}}}\right]$ , with $e=T-Y$ being the error variable and $E[.]$ the expectation operator. The MMSE is generally computationally simple and mathematically tractable, but its learning performance may degrade seriously when non-Gaussian noises are present in the variables [1].

To improve the learning performance in non-Gaussian noises, a variety of non-MMSE criteria have been proposed in the literature [1, 2, 3, 4, 5, 6, 7, 8]. Particularly in recent years, the maximum correntropy criterion (MCC) have found many successful applications in domains of signal processing and machine learning, which is very useful for the case where the signals are contaminated by heavy-tailed impulsive noises[9, 10, 11, 12, 13, 14, 15]. Under the MCC, an optimal model can be obtained by maximizing the correntropy between the target variable $T$ and the output $Y$ [4]:

[TABLE]

where ${M^{*}}$ is the optimal model, $\mathcal{M}$ stands for the model s hypothesis space, and ${V_{\sigma}}(T,Y)=E[{G_{\sigma}}({\rm{e}})]$ denotes the correntropy between $T$ and $Y$ , with ${G_{\sigma}}(e)$ being the Gaussian kernel function:

[TABLE]

where $\sigma$ is the kernel bandwidth. Since the Gaussian kernel function ${G_{\sigma}}(e)$ is a local function of the error variable $e$ , the correntropy can be used as an outlier-robust error measure in signal processing and machine learning [1]. However, the center of the Gaussian kernel in correntropy is always located at zero, which may not be a good choice for many practical situations. In particular, when the error distribution is non-zero-mean, the original correntropy may perform poorly, because in this case the zero-mean Gaussian function usually cannot match well the error distribution. The goal of the present paper is thus to extend the correntropy to the case where the center can be located anywhere, which potentially can significantly improve the learning performance but is still not fully appreciated in the community.

The rest of the paper is organized as follows. In section II, we define the correntropy with variable center and propose the maximum correntropy criterion with variable center (MCC-VC). In section III, we propose an efficient approach to optimize the kernel width and center location in MCC-VC. Simulation results of regression with linear in parameters (LIP) models are then presented in section IV. Finally, conclusion is given in section V.

II Maximum Correntropy Criterion with Variable Center

In this work, we define the correntropy with variable center between $T$ and $Y$ as follows:

[TABLE]

where $c\in\rm{\textbf{R}}$ is the center location. The above definition will reduce to the original correntropy ${V_{\sigma}}(T,Y)$ when $c=0$ .

Similar to the original correntropy [4], the correntropy with center $c$ also involves all the even moments of the error $e=T-Y$ about the center $c$ , that is

[TABLE]

As $\sigma$ increases, the high-order moments about the center $c$ will decay faster, hence the second-order moment tends to dominate the value. Particularly, when $c=E[e]$ and $\sigma\to\infty$ , maximizing the correntropy with center $c$ will be equivalent to minimizing the error’s variance.

In addition, when the Gaussian kernel shrinks to zero ( $\sigma\to 0$ ), the correntropy with center $c$ approaches the value of $\int_{-\infty}^{\infty}{{p_{TY}}(t,t-c)dt}$ , where ${p_{TY}}(t,y)$ is the joint probability density function (PDF) of $(T,Y)$ . This can easily be proved as follows

[TABLE]

where $\delta(.)$ denotes the Dirac delta function. In this case, we also have

[TABLE]

Therefore, when $\sigma\to 0$ , the correntropy with center $c$ will also approach the value of ${p_{e}}(\varepsilon)$ evaluated at $\varepsilon=c$ , where ${p_{e}}(.)$ denotes the error’s PDF.

The optimal model under the maximum correntropy criterion with variable center (MCC-VC) is defined by

[TABLE]

To demonstrate how to solve the optimal solution with finite training samples (by optimizing an empirical risk function), we consider the following linear in parameter (LIP) model:

[TABLE]

where $\{{\bm{x}_{i}},{y_{i}}\}_{i=1}^{N}$ are the $N$ input-output samples, ${\bm{h_{i}}}=\left[{{\phi_{1}}({\bm{x}_{i}}),{\phi_{2}}({\bm{x}_{i}}),\cdots,{\phi_{\tilde{N}}}({\bm{x}_{i}})}\right]\in{\textbf{R}^{\tilde{N}}}$ is the $i$ -th nonlinearly mapped input vector (a row vector), with ${\phi_{j}}(.)$ being the $j$ -th nonlinear mapping function $(j=1,2,\cdots\tilde{N})$ , and $\bm{\beta}={\left[{{\beta_{1}},{\beta_{2}},\cdots,{\beta_{\tilde{N}}}}\right]^{T}}\in{\textbf{R}^{\tilde{N}}}$ is the output weight vector that needs to be learned. Given $N$ target samples $\{{t_{i}}\}_{i=1}^{N}$ , the output weight vector $\bm{\beta}$ can be trained by minimizing the following regularized MMSE cost:

[TABLE]

where $\textbf{Y}={\left[{{y_{1}},{y_{2}},\cdots,{y_{N}}}\right]^{T}}$ , $\textbf{T}={\left[{{t_{1}},{t_{2}},\cdots,{t_{N}}}\right]^{T}}$ , and $\lambda\geq 0$ is the regularization parameter. In this case, the optimal solution can easily be obtained as

[TABLE]

where $\rm{\textbf{H}}=\left[{{\emph{h}_{\emph{ij}}}}\right]$ is an $N\times\tilde{N}$ dimensional matrix with ${h_{ij}}={\phi_{j}}({\bm{x}_{i}})$ . Similarly, one can solve $\bm{\beta}$ by minimizing the following regularized MCC-VC cost:

[TABLE]

where ${e_{i}}={t_{i}}-{y_{i}}={t_{i}}-\bm{{h_{i}}\beta}$ is the $i$ -th error sample. Setting $\frac{\partial}{{\partial\bm{\beta}}}{J_{MCC{\rm{-}}VC}}(\bm{\beta})=0$ , one can derive

[TABLE]

where $\lambda^{\prime}=2N\lambda$ , $\textbf{\rm{T}}^{\prime}={[{t_{1}}-c,{t_{2}}-c,\ldots,{t_{N}}-c]^{T}}$ , and ${\bf{\Lambda}}$ is a diagonal matrix with diagonal elements ${{\bf{\Lambda}}_{ii}}={G_{\sigma}}({e_{i}}-c)$ .

The solution (12) is a fixed-point equation since the diagonal matrix ${\bf{\Lambda}}$ on the right-hand side depends on the weight vector $\bm{\beta}$ via ${e_{i}}={t_{i}}-\bm{{h_{i}}\beta}$ . Therefore, the optimal solution under MCC-VC can be solved by using the following fixed-point iteration:

[TABLE]

where ${\bm{\beta}_{k}}$ is the estimated weight vector at the $k$ -th iteration.

III Optimization of the Free Parameters in MCC-VC

There are two free parameters in MCC-VC, namely the kernel width $\sigma$ and the center location $c$ , whose values have significant influence on the learning performance. In this section, we propose an efficient approach to optimize the two parameters. First, we divide the correntropy with center $c$ into three terms:

[TABLE]

Since the first term is independent of the model $M$ , we have

[TABLE]

where ${U_{\sigma,c}}(T,Y)=\int{{{\left[{{p_{e}}(\varepsilon)}\right]}^{2}}d\varepsilon}-\int{{{\left[{{G_{\sigma}}(\varepsilon-c)-{p_{e}}(\varepsilon)}\right]}^{2}}d\varepsilon}$ . Then we propose the following optimization:

[TABLE]

where $\mathcal{S}$ and $\mathcal{C}$ denote the admissible sets of parameters $\sigma$ and $c$ . Thus, the model $M$ , the kernel width $\sigma$ and the center location $c$ are jointly optimized to maximize the function ${U_{\sigma,c}}(T,Y)$ . To simplify the optimization, we adopt an alternative optimization approach:

i) When the model is fixed(hence the error’s distribution is fixed), the term $\int{{{\left({{p_{e}}(\varepsilon)}\right)}^{2}}d\varepsilon}$ is independent of $\sigma$ and $c$ , in this case the two free parameters can simply be optimized by

[TABLE]

ii) After the parameters have been determined, the model $M$ can then be optimized by maximizing the function (16) or (14) with $\sigma=\sigma^{*}$ and $c=c^{*}$ .

The above procedure can be repeated until convergence.

From (17), one can see that the parameters $\sigma$ and $c$ are optimized such that the Gaussian kernel function ${G_{\sigma}}(\varepsilon-c)$ matches the error’s PDF ${p_{e}}(\varepsilon)$ as closely as possible. This is in principle consistent with our intuition. The idea of PDF matching has been explored with great success in the literature of information theoretic learning (ITL) [1, 16, 17, 18]. Given $N$ error samples $\{{e_{i}}\}_{i=1}^{N}$ , we have $E\left[{{G_{\sigma}}(e-c)}\right]\approx\frac{1}{N}\sum\limits_{i=1}^{N}{{G_{\sigma}}({e_{i}}-c)}$ . It follows that

[TABLE]

Remark: There are several approaches to solve the optimization problem in (18). For example, we can use a gradient based method to search the solution. In many practical situations, we often find the optimal solution in a given finite set. To further simplify the computation, one can just set the parameter $c$ to the mean or median value of the error samples, and only optimize the kernel width $\sigma$ .

Based on the above parameters optimization strategy, a robust regression algorithm with LIP models under MCC-VC can be obtained, which is referred to as the LIP-MCC-VC and is described in Algorithm 1.

IV Simulation Results

In this section, we present simulation results of regression with LIP models to demonstrate the performance of the proposed method. We consider two LIP models, one is the linear regression model and the another is the extreme learning machine (ELM) [19, 20, 21, 22], a kind of single hidden layer feed forward neural network (SLFN), in which the input weights and biases of the hidden layer are randomly generated, and only the weights of the output layer need to be trained.

IV-A Linear Regression

Consider a simple example in which the data are generated by a two-dimensional linear system ${y_{i}}={\bm{w}^{*}}^{T}{\bm{x}_{i}}+{\rho_{i}}$ , where ${\bm{w}^{*}}={[1,2]^{T}}$ and ${\rho_{i}}$ is an additive noise. The input samples $\{{\bm{x}_{i}}\}$ are uniformly distributed over $[-2,2]\times[-2,2]$ . The noise ${\rho_{i}}$ comprises two mutually independent noises, namely the inner noise ${B_{i}}$ and the outlier noise ${O_{i}}$ . Specifically, ${\rho_{i}}$ is given by $\rho{}_{i}={\rm{(}}1-{g_{i}}{\rm{)}}{B_{i}}+{g_{i}}{O_{i}}$ , where ${g_{i}}$ is a binary variable with probability mass $\Pr{\rm{\{}}{g_{i}}=1{\rm{\}}}=p$ , $\Pr{\rm{\{}}{g_{i}}=0{\rm{\}}}=1-p$ , $(0\leq p\leq 1)$ , which is assumed to be independent of both $B_{i}$ and $O_{i}$ . In this example, $p$ is set at $0.1$ , and the outlier $O_{i}$ is drawn from a zero-mean Gaussian distribution with variance $10000$ . For the inner noise $B_{i}$ , we consider four zero-mean or non-zero-mean distributions: 1) $\mathcal{N}$ (0,2), where $\mathcal{N}(u,\sigma^{2})$ denotes the Gaussian PDF with mean $u$ and variance ${\sigma^{2}}$ ; 2) $\mathcal{N}$ (3,1); 3) Laplace distribution with zero-mean and variance 1; 4) Chi-square distribution with three degrees of freedom. The root mean squared error (RMSE) is employed to measure the performance, computed by $RMSE=\sqrt{\frac{1}{2}{{\left\|{{\bm{w}_{k}}-{\bm{w}^{*}}}\right\|}^{2}}}$ , where $\bm{w}_{k}$ and $\bm{w}^{*}$ denote the estimated and the target weight vectors respectively.

We compare the performance of three optimization criteria, namely MMSE, MCC and MCC-VC. For MMSE, there is a closed-form solution, so no iteration is needed. For MCC and MCC-VC, a fixed-point iteration is used to solve the model (see [23] for the fixed-point algorithm under MCC). The mean $\pm$ deviation results of the RMSE and the training time averaged over 100 Monte Carlo runs are presented in Table I. In the simulation, the sample number is $N$ = 400, the iteration number is $K$ = 100, and the initial weight vector is set to ${\bm{w}_{0}}={[{\rm{0}},{\rm{0}}]^{T}}$ . For each criterion, the parameters are selected by trial-and-error to achieve the best results, except that the kernel bandwidth and center location of MCC-VC are chosen through solving the optimization (18). The finite kernel bandwidth set $\mathcal{S}$ is equally spaced over $[0.2,5.0]$ with step size 0.2, and the center set $\mathcal{C}$ is equally spaced over $[-5.0,5.0]$ with step size 0.1. From Table I, we observe: i) MCC and MCC-VC can significantly outperform MMSE although both have no closed-form solution; ii) MCC-VC can achieve better performance than MCC especially for non-zero-mean noises because the cost function center can be set at proper value according to the error PDF adaptively; iii) MCC-VC can save much time through solving (18) to find the best values of parameters $\sigma$ and $c$ , without performing trial-and-error to optimize the two parameters. Under the noise of case 2), the error distribution and corresponding Gaussian kernel function $G_{\sigma^{*}}(e-c^{*})$ optimized by (18) at the first and second fixed-point iterations of MCC-VC are shown in Fig. 1. As expected, the Gaussian kernel function matches the error distribution very well.

IV-B ELM Based Regression for Benchmark Datasets

In the second example, we utilize seven benchmark data sets from UCI machine learning repository [24] to confirm the superior regression performance of the MCC-VC based ELM (ELM-MCC-VC) compared with the MCC based ELM (ELM-RCC) [22] and regularized ELM (RELM)([21]). The descriptions of the data sets are given in Table II. In the simulation, the training and testing samples from each data set are randomly chosen and the data values are normalized into [0, 1]. The parameters of each algorithm are selected through fivefold cross-validation, except that the kernel bandwidth and center location of MCC-VC are chosen through solving (18). We set the kernel center of MCC-VC to the median value of the error samples, only optimize the kernel width $\sigma$ by solving (18). The finite kernel bandwidth set $\mathcal{S}$ is equally spaced over $[0.1,2.0]$ with step size 0.1. The training and testing RMSEs over 100 runs are presented in Table III. Evidently, The ELM-MCC-VC outperforms the ELM-RCC and RELM for all the data sets. Especially on the Yacht data set, MCC-VC can significantly outperform other methods.

V Conclusion

The kernel function in Correntropy is in general a Gaussian function and the kernel center is always located at zero. In this paper, we extended the correntropy to the case where the center can locate at any position. On this basis, the maximum correntropy criterion with variable center (MCC-VC) was proposed. In addition, we proposed an efficient method to optimize the kernel width and center location in MCC-VC. Regression results with linear in parameters (LIP) models have shown the desirable performance of the new method.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives . Springer Science & Business Media, 2010.
2[2] S.-C. Pei and C.-C. Tseng, “Least mean p-power error criterion for adaptive fir filter,” IEEE Journal on Selected Areas in Communications , vol. 12, no. 9, pp. 1540–1547, 1994.
3[3] D. Erdogmus and J. C. Principe, “Generalized information potential criterion for adaptive system training,” IEEE Transactions on Neural Networks , vol. 13, no. 5, pp. 1035–1044, 2002.
4[4] W. Liu, P. P. Pokharel, and J. C. Príncipe, “Correntropy: Properties and applications in non-gaussian signal processing,” IEEE Transactions on Signal Processing , vol. 55, no. 11, pp. 5286–5298, 2007.
5[5] B. Chen, P. Zhu, and J. C. Principe, “Survival information potential: a new criterion for adaptive system training,” IEEE Transactions on Signal Processing , vol. 60, no. 3, pp. 1184–1194, 2012.
6[6] M. O. Sayin, N. D. Vanli, and S. S. Kozat, “A novel family of adaptive filtering algorithms based on the logarithmic cost.” IEEE Trans. Signal Processing , vol. 62, no. 17, pp. 4411–4424, 2014.
7[7] B. Chen, L. Xing, H. Zhao, N. Zheng, J. C. Prı et al. , “Generalized correntropy for robust adaptive filtering,” IEEE Transactions on Signal Processing , vol. 64, no. 13, pp. 3376–3387, 2016.
8[8] B. Chen, L. Xing, B. Xu, H. Zhao, N. Zheng, and J. C. Principe, “Kernel risk-sensitive loss: Definition, properties and application to robust adaptive filtering,” IEEE Transactions on Signal Processing , vol. 65, no. 11, pp. 2888–2901, 2017.