Maximum Correntropy Criterion with Variable Center
Badong Chen, Xin Wang, Yingsong Li, Jose C. Principe

TL;DR
This paper introduces a novel extension of the maximum correntropy criterion that allows the kernel center to vary, improving flexibility and performance in signal processing tasks.
Contribution
The paper proposes MCC-VC, an extended correntropy measure with a variable kernel center, along with an optimization approach for kernel parameters.
Findings
Enhanced regression performance in simulations
Flexible kernel positioning improves robustness
Efficient optimization of kernel parameters
Abstract
Correntropy is a local similarity measure defined in kernel space and the maximum correntropy criterion (MCC) has been successfully applied in many areas of signal processing and machine learning in recent years. The kernel function in correntropy is usually restricted to the Gaussian function with center located at zero. However, zero-mean Gaussian function may not be a good choice for many practical applications. In this study, we propose an extended version of correntropy, whose center can locate at any position. Accordingly, we propose a new optimization criterion called maximum correntropy criterion with variable center (MCC-VC). We also propose an efficient approach to optimize the kernel width and center location in MCC-VC. Simulation results of regression with linear in parameters (LIP) models confirm the desirable performance of the new method.
| MMSE | MCC | MCC-VC | ||
|---|---|---|---|---|
| case 1) | RMSE | 0.0765 | ||
| TIME(sec) | N/A | |||
| case 2) | RMSE | |||
| TIME(sec) | N/A | |||
| case 3) | RMSE | |||
| TIME(sec) | N/A | |||
| case 4) | RMSE | |||
| TIME(sec) | N/A |
| Datasets | RELM | ELM-RCC | ELM-MCC-VC | |||
|---|---|---|---|---|---|---|
| Training RMSE | Testing RMSE | Training RMSE | Testing RMSE | Training RMSE | Testing RMSE | |
| Servo | ||||||
| Airfoil | ||||||
| Concrete | ||||||
| Housing | ||||||
| Yacht | ||||||
| Wine-red | ||||||
| Slump | ||||||
| Datasets | Features | Observations | |
|---|---|---|---|
| Training | Testing | ||
| Servo | 5 | 83 | 83 |
| Airfoil | 5 | 751 | 751 |
| Concrete | 9 | 515 | 515 |
| Housing | 14 | 253 | 253 |
| Yacht | 6 | 154 | 154 |
| Wine-red | 12 | 799 | 799 |
| Slump | 10 | 52 | 51 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Maximum Correntropy Criterion with Variable Center
Badong Chen, Xin Wang, Yingsong Li, and Jose C. Principe This work was supported by National Key R&D Program of China (No. 2017YFB1002501), 973 Program (No. 2015CB351703) and National NSF of China (No. 91648208, No. U1613219).Badong Chen and Xin Wang([email protected] and wangxin0420@
stu.xjtu.edu.cn) are with the School of Electronic and Information Engineering,Xi’an Jiaotong University, Xi’an 710049, Shaanxi, China.Yingsong Li([email protected]) is with the College of Information and Communication Engineering, Harbin Engineering University, Harbin 150001, China and also with the National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China.Jose C. Principe([email protected]) is with the School of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, Shaanxi, China and also with the Department of Electrical and Computer Engineering, University of Florida, Gainesville, USA.
Abstract
Correntropy is a local similarity measure defined in kernel space and the maximum correntropy criterion (MCC) has been successfully applied in many areas of signal processing and machine learning in recent years. The kernel function in correntropy is usually restricted to the Gaussian function with center located at zero. However, zero-mean Gaussian function may not be a good choice for many practical applications. In this study, we propose an extended version of correntropy, whose center can locate at any position. Accordingly, we propose a new optimization criterion called maximum correntropy criterion with variable center (MCC-VC). We also propose an efficient approach to optimize the kernel width and center location in MCC-VC. Simulation results of regression with linear in parameters (LIP) models confirm the desirable performance of the new method.
Index Terms:
Correntropy, maximum correntropy criterion (MCC), maximum correntropy criterion with variable center (MCC-VC), robust learning.
I Introduction
One of the most important problems in machine learning is how to approximate a target random variable () knowing another (). This is a central problem in supervised learning, where we design a model () that receives a random variable and outputs that should approximate in some sense. The difficulty requires the definition of a loss function (or a similarity measure) to compare with . The minimum mean square error (MMSE) criterion is widely used where the loss function is , with being the error variable and the expectation operator. The MMSE is generally computationally simple and mathematically tractable, but its learning performance may degrade seriously when non-Gaussian noises are present in the variables [1].
To improve the learning performance in non-Gaussian noises, a variety of non-MMSE criteria have been proposed in the literature [1, 2, 3, 4, 5, 6, 7, 8]. Particularly in recent years, the maximum correntropy criterion (MCC) have found many successful applications in domains of signal processing and machine learning, which is very useful for the case where the signals are contaminated by heavy-tailed impulsive noises[9, 10, 11, 12, 13, 14, 15]. Under the MCC, an optimal model can be obtained by maximizing the correntropy between the target variable and the output [4]:
[TABLE]
where is the optimal model, stands for the model s hypothesis space, and denotes the correntropy between and , with being the Gaussian kernel function:
[TABLE]
where is the kernel bandwidth. Since the Gaussian kernel function is a local function of the error variable , the correntropy can be used as an outlier-robust error measure in signal processing and machine learning [1]. However, the center of the Gaussian kernel in correntropy is always located at zero, which may not be a good choice for many practical situations. In particular, when the error distribution is non-zero-mean, the original correntropy may perform poorly, because in this case the zero-mean Gaussian function usually cannot match well the error distribution. The goal of the present paper is thus to extend the correntropy to the case where the center can be located anywhere, which potentially can significantly improve the learning performance but is still not fully appreciated in the community.
The rest of the paper is organized as follows. In section II, we define the correntropy with variable center and propose the maximum correntropy criterion with variable center (MCC-VC). In section III, we propose an efficient approach to optimize the kernel width and center location in MCC-VC. Simulation results of regression with linear in parameters (LIP) models are then presented in section IV. Finally, conclusion is given in section V.
II Maximum Correntropy Criterion with Variable Center
In this work, we define the correntropy with variable center between and as follows:
[TABLE]
where is the center location. The above definition will reduce to the original correntropy when .
Similar to the original correntropy [4], the correntropy with center also involves all the even moments of the error about the center , that is
[TABLE]
As increases, the high-order moments about the center will decay faster, hence the second-order moment tends to dominate the value. Particularly, when and , maximizing the correntropy with center will be equivalent to minimizing the error’s variance.
In addition, when the Gaussian kernel shrinks to zero (), the correntropy with center approaches the value of , where is the joint probability density function (PDF) of . This can easily be proved as follows
[TABLE]
where denotes the Dirac delta function. In this case, we also have
[TABLE]
Therefore, when , the correntropy with center will also approach the value of evaluated at , where denotes the error’s PDF.
The optimal model under the maximum correntropy criterion with variable center (MCC-VC) is defined by
[TABLE]
To demonstrate how to solve the optimal solution with finite training samples (by optimizing an empirical risk function), we consider the following linear in parameter (LIP) model:
[TABLE]
where are the input-output samples, is the -th nonlinearly mapped input vector (a row vector), with being the -th nonlinear mapping function , and is the output weight vector that needs to be learned. Given target samples , the output weight vector can be trained by minimizing the following regularized MMSE cost:
[TABLE]
where , , and is the regularization parameter. In this case, the optimal solution can easily be obtained as
[TABLE]
where is an dimensional matrix with . Similarly, one can solve by minimizing the following regularized MCC-VC cost:
[TABLE]
where is the -th error sample. Setting , one can derive
[TABLE]
where , , and is a diagonal matrix with diagonal elements .
The solution (12) is a fixed-point equation since the diagonal matrix on the right-hand side depends on the weight vector via . Therefore, the optimal solution under MCC-VC can be solved by using the following fixed-point iteration:
[TABLE]
where is the estimated weight vector at the -th iteration.
III Optimization of the Free Parameters in MCC-VC
There are two free parameters in MCC-VC, namely the kernel width and the center location , whose values have significant influence on the learning performance. In this section, we propose an efficient approach to optimize the two parameters. First, we divide the correntropy with center into three terms:
[TABLE]
Since the first term is independent of the model , we have
[TABLE]
where . Then we propose the following optimization:
[TABLE]
where and denote the admissible sets of parameters and . Thus, the model , the kernel width and the center location are jointly optimized to maximize the function . To simplify the optimization, we adopt an alternative optimization approach:
i) When the model is fixed(hence the error’s distribution is fixed), the term is independent of and , in this case the two free parameters can simply be optimized by
[TABLE]
ii) After the parameters have been determined, the model can then be optimized by maximizing the function (16) or (14) with and .
The above procedure can be repeated until convergence.
From (17), one can see that the parameters and are optimized such that the Gaussian kernel function matches the error’s PDF as closely as possible. This is in principle consistent with our intuition. The idea of PDF matching has been explored with great success in the literature of information theoretic learning (ITL) [1, 16, 17, 18]. Given error samples , we have . It follows that
[TABLE]
Remark: There are several approaches to solve the optimization problem in (18). For example, we can use a gradient based method to search the solution. In many practical situations, we often find the optimal solution in a given finite set. To further simplify the computation, one can just set the parameter to the mean or median value of the error samples, and only optimize the kernel width .
Based on the above parameters optimization strategy, a robust regression algorithm with LIP models under MCC-VC can be obtained, which is referred to as the LIP-MCC-VC and is described in Algorithm 1.
IV Simulation Results
In this section, we present simulation results of regression with LIP models to demonstrate the performance of the proposed method. We consider two LIP models, one is the linear regression model and the another is the extreme learning machine (ELM) [19, 20, 21, 22], a kind of single hidden layer feed forward neural network (SLFN), in which the input weights and biases of the hidden layer are randomly generated, and only the weights of the output layer need to be trained.
IV-A Linear Regression
Consider a simple example in which the data are generated by a two-dimensional linear system , where and is an additive noise. The input samples are uniformly distributed over . The noise comprises two mutually independent noises, namely the inner noise and the outlier noise . Specifically, is given by , where is a binary variable with probability mass , , , which is assumed to be independent of both and . In this example, is set at , and the outlier is drawn from a zero-mean Gaussian distribution with variance . For the inner noise , we consider four zero-mean or non-zero-mean distributions: 1) (0,2), where denotes the Gaussian PDF with mean and variance ; 2) (3,1); 3) Laplace distribution with zero-mean and variance 1; 4) Chi-square distribution with three degrees of freedom. The root mean squared error (RMSE) is employed to measure the performance, computed by , where and denote the estimated and the target weight vectors respectively.
We compare the performance of three optimization criteria, namely MMSE, MCC and MCC-VC. For MMSE, there is a closed-form solution, so no iteration is needed. For MCC and MCC-VC, a fixed-point iteration is used to solve the model (see [23] for the fixed-point algorithm under MCC). The mean deviation results of the RMSE and the training time averaged over 100 Monte Carlo runs are presented in Table I. In the simulation, the sample number is = 400, the iteration number is = 100, and the initial weight vector is set to . For each criterion, the parameters are selected by trial-and-error to achieve the best results, except that the kernel bandwidth and center location of MCC-VC are chosen through solving the optimization (18). The finite kernel bandwidth set is equally spaced over with step size 0.2, and the center set is equally spaced over with step size 0.1. From Table I, we observe: i) MCC and MCC-VC can significantly outperform MMSE although both have no closed-form solution; ii) MCC-VC can achieve better performance than MCC especially for non-zero-mean noises because the cost function center can be set at proper value according to the error PDF adaptively; iii) MCC-VC can save much time through solving (18) to find the best values of parameters and , without performing trial-and-error to optimize the two parameters. Under the noise of case 2), the error distribution and corresponding Gaussian kernel function optimized by (18) at the first and second fixed-point iterations of MCC-VC are shown in Fig. 1. As expected, the Gaussian kernel function matches the error distribution very well.
IV-B ELM Based Regression for Benchmark Datasets
In the second example, we utilize seven benchmark data sets from UCI machine learning repository [24] to confirm the superior regression performance of the MCC-VC based ELM (ELM-MCC-VC) compared with the MCC based ELM (ELM-RCC) [22] and regularized ELM (RELM)([21]). The descriptions of the data sets are given in Table II. In the simulation, the training and testing samples from each data set are randomly chosen and the data values are normalized into [0, 1]. The parameters of each algorithm are selected through fivefold cross-validation, except that the kernel bandwidth and center location of MCC-VC are chosen through solving (18). We set the kernel center of MCC-VC to the median value of the error samples, only optimize the kernel width by solving (18). The finite kernel bandwidth set is equally spaced over with step size 0.1. The training and testing RMSEs over 100 runs are presented in Table III. Evidently, The ELM-MCC-VC outperforms the ELM-RCC and RELM for all the data sets. Especially on the Yacht data set, MCC-VC can significantly outperform other methods.
V Conclusion
The kernel function in Correntropy is in general a Gaussian function and the kernel center is always located at zero. In this paper, we extended the correntropy to the case where the center can locate at any position. On this basis, the maximum correntropy criterion with variable center (MCC-VC) was proposed. In addition, we proposed an efficient method to optimize the kernel width and center location in MCC-VC. Regression results with linear in parameters (LIP) models have shown the desirable performance of the new method.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives . Springer Science & Business Media, 2010.
- 2[2] S.-C. Pei and C.-C. Tseng, “Least mean p-power error criterion for adaptive fir filter,” IEEE Journal on Selected Areas in Communications , vol. 12, no. 9, pp. 1540–1547, 1994.
- 3[3] D. Erdogmus and J. C. Principe, “Generalized information potential criterion for adaptive system training,” IEEE Transactions on Neural Networks , vol. 13, no. 5, pp. 1035–1044, 2002.
- 4[4] W. Liu, P. P. Pokharel, and J. C. Príncipe, “Correntropy: Properties and applications in non-gaussian signal processing,” IEEE Transactions on Signal Processing , vol. 55, no. 11, pp. 5286–5298, 2007.
- 5[5] B. Chen, P. Zhu, and J. C. Principe, “Survival information potential: a new criterion for adaptive system training,” IEEE Transactions on Signal Processing , vol. 60, no. 3, pp. 1184–1194, 2012.
- 6[6] M. O. Sayin, N. D. Vanli, and S. S. Kozat, “A novel family of adaptive filtering algorithms based on the logarithmic cost.” IEEE Trans. Signal Processing , vol. 62, no. 17, pp. 4411–4424, 2014.
- 7[7] B. Chen, L. Xing, H. Zhao, N. Zheng, J. C. Prı et al. , “Generalized correntropy for robust adaptive filtering,” IEEE Transactions on Signal Processing , vol. 64, no. 13, pp. 3376–3387, 2016.
- 8[8] B. Chen, L. Xing, B. Xu, H. Zhao, N. Zheng, and J. C. Principe, “Kernel risk-sensitive loss: Definition, properties and application to robust adaptive filtering,” IEEE Transactions on Signal Processing , vol. 65, no. 11, pp. 2888–2901, 2017.
