Robustness analysis of a Maximum Correntropy framework for linear regression
Laurent Bako

TL;DR
This paper presents a unified framework for robust linear regression using correntropy maximization, analyzing its robustness properties and providing bounds on estimation errors, with numerical illustrations of special cases.
Contribution
It introduces a general correntropy-based regression framework that encompasses Gaussian and Laplacian kernels, and analyzes its robustness and stability properties.
Findings
Bounded estimation error under certain conditions
Explicit error bounds derived and discussed
Numerical studies of special cases included
Abstract
In this paper we formulate a solution of the robust linear regression problem in a general framework of correntropy maximization. Our formulation yields a unified class of estimators which includes the Gaussian and Laplacian kernel-based correntropy estimators as special cases. An analysis of the robustness properties is then provided. The analysis includes a quantitative characterization of the informativity degree of the regression which is appropriate for studying the stability of the estimator. Using this tool, a sufficient condition is expressed under which the parametric estimation error is shown to be bounded. Explicit expression of the bound is given and discussion on its numerical computation is supplied. For illustration purpose, two special cases are numerically studied.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
††thanks: This paper was not presented at any IFAC meeting. Corresponding author L. Bako. Tel.: +33 472 186 452.
Robustness analysis of a Maximum Correntropy framework for linear regression
Laurent Bako [email protected] Laboratoire Ampère – Ecole Centrale de Lyon – Université de Lyon, France
Abstract
In this paper we formulate a solution of the robust linear regression problem in a general framework of correntropy maximization. Our formulation yields a unified class of estimators which includes the Gaussian and Laplacian kernel-based correntropy estimators as special cases. An analysis of the robustness properties is then provided. The analysis includes a quantitative characterization of the informativity degree of the regression which is appropriate for studying the stability of the estimator. Using this tool, a sufficient condition is expressed under which the parametric estimation error is shown to be bounded. Explicit expression of the bound is given and discussion on its numerical computation is supplied. For illustration purpose, two special cases are numerically studied.
keywords:
robust estimation, system identification, maximum correntropy, outliers.
1 Introduction
Given a set of empirical observations generated by a system along with a class of parameterized candidate models, a parameter estimator is a function which maps the available data to the parameter space associated with the model class. A very desirable property for an estimator is that of robustness which characterizes a relative insensitivity of the estimator to deviations of the observed data from the assumed model. More specifically, this property is central in situations where the data are prone to non Gaussian noise or disturbances of possibly arbitrarily large amplitude (often called outliers). The quest for robust estimators has led to the development of many estimators such as the Least Absolute Deviation (LAD) [17, 14, 4, 2], the least median of squares [16], the least trimmed squares [17], the class of M-estimators [11]. Evaluating formally to what extent a given estimator is robust requires setting a quantitative measure of robustness. Incidentally such a measure can serve as comparison criterion between different robust estimators. Generally, the robustness is assessed in term of the maximum proportion of outliers in the total data set that the estimator can handle while remaining stable (see for example the concept of breakdown point [17]). More recently the maximum correntropy [18, 12, 15] has emerged as an information-theoretic estimation framework which induces some robustness properties with respect to outliers. Although maximum correntropy estimation is closely related to M-estimation, its discovery has broadened the horizon of possibilities for designing robust identification schemes. As a matter of fact, it has been successfully applied to a variety of estimation problems such as linear/nonlinear regression, filtering, face recognition in computer vision [9, 7, 10].
Contribution
Although the maximum correntropy based estimators have been gaining an increasing success, the formal analysis of its robustness properties is still a largely open research question. In this paper we propose such an analysis for a class of maximum correntropy based estimators applying to linear regression problems. More precisely, the contribution of the current paper is articulated around the following three questions:
- •
To what extent the maximum correntropy estimation framework is robust to outliers? By robustness, it is meant here a certain insensitivity of the estimator to large errors of possibly arbitrarily large magnitude. To address this question, we derive parametric estimation error bounds induced by the estimator in function of both the degree of richness of the regression data and on the fraction of outliers. In summary, we show that if the regression data enjoy some richness properties and if the number of outliers is reasonably small, then the parametric estimation error remains stable. Indeed the proportion of outliers that the estimator is capable to correct depends on how rich the regressor matrix is. Moreover, the estimation error appears to be a decreasing function of the richness measure.
- •
How does richness of the training data set influence the robustness of the estimator and how to characterize it? We provide an appropriate characterization of the richness in terms of the cardinality of the regressor vectors which are strongly correlated to any vector of the regression space. As such however, this quantitative measure of richness is not computable at an affordable price. To alleviate this difficulty the paper proposes some estimates of this measure thus allowing for the approximation of the parametric estimation error bounds.
- •
Does the maximum correntropy estimator (MCE) possess the exact recovery property? We show that unlike the LAD estimator, the MCE is not able to return exactly the true parameter vector once the measurement is affected by a single arbitrary nonzero error. The proof is given for the Gaussian kernel based estimator.
We note that an analysis of robustness of the maximum correntropy has been presented recently in [5, 6]. However the analysis there is limited to the Gaussian kernel based correntropy and to a single parameter estimation problem. Moreover these works do not make clear how the properties of the data contribute to the robustness of the estimator.
Outline
The rest of this paper is organized as follows. Section 2 presents the robust regression problem and define the class of maximum correntropy estimators whose properties are to be studied in the paper. It also introduces the general setting of the paper. The main analysis results are developed in Section 3. In Section 4 we run numerical experiments to illustrate the richness measure and the evolution of the derived error bounds with respect to the amount of noise. Finally, Section 5 contains concluding remarks concerning this work.
Notations
is the set of real numbers; is the set of real nonnegative numbers; is the set of natural integers; denotes the set of complex numbers. will denote the number of data points and the associated index set. For any finite set , refers to the cardinality of . However, whenever is a real (respectively complex) number, will refer to the absolute value (respectively modulus) of . For , will denote the -norm of defined by , for , . The exponential of a real number will be denoted or according to visual convenience; is the natural logarithm function. For a square and positive semi-definite matrix , and denote respectively the minimal and maximal eigenvalues of .
2 Robust regression problem
2.1 The data-generating system
Let and be some stochastic processes taking values respectively in and . They are assumed to be related by an equation of the form
[TABLE]
where represent an unobserved error sequence; is an unknown parameter vector. Eq. (1) may describe a static (memoryless) system or a dynamic one. In the latter case, we will conveniently assume that the so-called regressor (or explanatory vector) has the following structure , i.e., (1) is an FIR-type (Finite Impulse Response) system, with then denoting its input signal at time .
Assumption 1**.**
The joint stochastic process is independently and identically distributed.
While this assumption can hold naturally for a static system, it might not be satisfied in some practical situations. For example, if (1) is a dynamic system (for instance, of FIR-type), this assumption is not satisfied111Indeed this assumption can be relaxed to an appropriate notion of stationarity and ergodicity for the joint process . But as will be seen, its only role is to highlight the correntropic origin of the estimation framework considered in this paper.
Assumption 2**.**
The noise sequence satisfies the following: there is such that if we define the index sets and , then the cardinality of is "much larger" than that of .
We will formalize latter in the paper what "much larger" can mean. Similarly as in [2], we can assume that is of the form where is a sparse noise sequence in the sense that only a few elements of it are different from zero. However its nonzero elements are allowed to take on arbitrarily large values (called in this case, outliers). As to , it is assumed to be a bounded and dense (i.e., not necessarily sparse) noise sequence of rather moderate amplitude.
Problem
Given a finite collection of measurements obeying the system equation (1), the robust regression problem of interest here is the one of finding a reliable estimate of the parameter vector despite the effect of arbitrarily large errors.
Let denote a candidate parameter vector (PV) which we would like, ideally, to coincide with the true PV . Given and , the prediction we can make of is . It is then the goal of the estimation method to select such that and are close in some sense for any . Closeness will be be measured in term of the so-called maximum correntropy between the measured output and the predicted value .
2.2 Maximum correntropy estimation
The correntropy is an information-theoretic measure of similarity between two arbitrary random variables [18, 12]. More specifically, consider two random variables and defined on the same probability space, and taking values in . Let be a positive-definite kernel function on (see e.g., [19, Chap. 2, p. 30] for a definition). The correntropy between and with respect to a kernel function , is defined by
[TABLE]
where refers to the expected value with respect to the joint distribution of . In a more explicit form, we have
[TABLE]
with being the joint probability density function of . The correntropy constitutes a similarity measure between and through the kernel . Although the original definition of correntropy in [18] fixes to be the Gaussian kernel, it is indeed possible to extend it to any positive definite kernel function.
We consider in this paper a kernel function of the form
[TABLE]
where is a user-specified parameter and is a function which satisfies the following properties:
- P1.
is positive-definite: and if and only if . 2. P2.
is symmetric: . 3. P3.
is nondecreasing on : whenever . 4. P4.
There exists such that
.
Property P4 can be interpreted as a relaxed version of the triangle inequality property for . We can characterize a family of functions satisfying P1-P4 as follows.
Lemma 3** (Examples of functions obeying P1-P4).**
For any real number , the function defined by satisfies the properties P1-P4. In particular, P4 is satisfied with .
{pf}
That satisfies P1-P3 is an obvious fact. As to Property P4, it follows from convexity. In effect the convexity of implies that for all , . Multiplying by gives , which by replacing with can be seen to be equivalent to P4 with . ∎
The correntropy maximization is an estimation framework where one tries to maximize the correntropy. In the regression problem stated above, we aim to find the parameter vector that maximizes222By Assumption 1, is indeed constant i.e., independent of . Hence refers here to an arbitrary time index. , the correntropy between and with respect to the kernel . In practice however the distribution333To be precise, the interest is in but this follows from . is generally unknown so that one cannot evaluate the exact correntropy. As a consequence of this difficulty one would be content in practice with maximizing a sample estimate of the correntropy. Assume that we are given a set of data points sampled independently from the joint distribution . Then in virtue of Assumption 1, an estimate of the correntropy is given by
[TABLE]
for all . Hence the maximum correntropy estimator (MCE) studied in this paper is the possibly set-valued map which maps the data to a parameter space,
[TABLE]
In the form (4)-(5) the MCE can be viewed as a particular instance of the prediction error estimation scheme [13, Chap. 7] with prediction error measured by . Also, the performance index (4) is reminiscent of the risk-sensitive estimation cost which is used in control, adaptive filtering and parameter estimation [3, 8]. But this latter approach, which roughly consists in the minimization of a sum of exponential of positive error terms, is not suitable for handling the effects of impulsive noise such as outliers.
Although the focus of this paper is the analysis of the properties of the estimator (5), let us mention in passing that the underlying optimization problem in (5) is non convex. This implies that solving (5) numerically can be challenging. However it can be interpreted iteratively as a weighted least squares problem in the case for example where is taken to be the Gaussian kernel. We will get back to this in Section 4.
3 Robustness properties of the MCE
As discussed in the introduction, an estimator of the form (5) is intuitively thought (and empirically shown) to be endowed with some robustness properties. By this, we mean that it is able to keep behaving reasonably well when a certain fraction of the available data points are affected by noise components of possibly arbitrarily large magnitude. The question of main interest in this paper is to characterize quantitatively up to what extent the estimator defined in (5) can be insensitive to outliers.
3.1 Data informativity
As will be seen, the robustness property is inherited from both the structure of the estimator and the richness of the regression data. We are therefore interested in formalizing as well that richness and how it contributes to the robustness properties of the estimator.
To proceed with the analysis, let us introduce some notations. For convenience we make the following assumption.
Assumption 4**.**
The regressor sequence satisfies: for all .
Note that Assumption 4 is without loss of generality. Under this assumption, let us pose
[TABLE]
with denoting Euclidean norm. Upon dividing the system equation (1) by , we can even assume that . Let be a real number. For any , define the index set
[TABLE]
with a matrix formed with all the regressors. Finally, let be the ratio between the minimum cardinality that can attain over all possible values of , and the number of columns in , i.e.,
[TABLE]
The number measures somehow the richness (or informativity/genericity) of the regression data. Intuitively, reflects a dense spanning of all directions of the vector space by the vectors . For a given , it is desired that be as large as possible. We will refer to it as the correlation measure of the matrix at the level .
It appears intuitively that is a decreasing function of . Clearly, we get for for finite while for . For a given matrix, it would be interesting to be able to evaluate numerically the quantitative measure of richness. Indeed, this value will be required for numerical assessment of the error bound to be derived in Section 3.2. However computing exactly the value of is a hard combinatorial problem.
We therefore discuss how to reach estimates of at an affordable cost. To this end, let be the matrix obtained from by normalizing its columns to unit -norm, i.e., for all . Then introduce the number
[TABLE]
which is solely a function of the matrix , hence the notation. Note that the so-defined lies necessarily in the real interval . Moreover, it can be usefully observed that , with referring to the square root of the minimum eigenvalue. Now for any consider the following index set
[TABLE]
where . It is assumed in the definition (10) that so that . For a given , collects the indices of the regressors which are the most correlated to in the sense that the cosine of the angle they form with is larger than . Finally, let
[TABLE]
be the ratio between the minimum cardinality of the finite set over all living in and the number of columns in . Then we can estimate as follows.
Proposition 5**.**
Let be a real matrix. Then, for all with ,
[TABLE]
with denoting the minimum eigenvalue of the matrix .
The proof of this proposition uses the following lemma.
Lemma 6**.**
[20, Thm 5.14]** Let be such that with denoting the conjugate transpose of . Then
[TABLE]
Equality holds if and only if there exists such that either or .
{pf}
[Proof of Proposition 5] The upper bound is immediate. To see this, let be the eigenvector associated with the smallest eigenvalue of . Then
[TABLE]
It follows that . The upper inequality in (12) follows by additionally taking into consideration the obvious fact that .
We now prove the inequality . To begin with, note from (9) that for any satisfying , there exists such that . Consider an index , such that for some . Then observe that is equivalent to
[TABLE]
On the other hand, by applying Lemma 6, we can write
[TABLE]
It follows that for to hold, it is sufficient that
[TABLE]
which in turn is equivalent to with by the assumption that . Hence, for to be greater than or equal to , it is enough that . This means that for a given , being in the index set defined in (10) is a sufficient condition for for all such that . Therefore hence implying that . Taking now the infimum produces . ∎
The key benefit of Proposition 5 is that it provides a method for estimating the measure defined in (8) at an affordable cost. Note however that while the upper bound in (12) can be computed easily, obtaining the lower bound is still challenging. The reason is that this bound involves the number in (9) whose numerical evaluation requires solving a nonconvex optimization problem. Nevertheless, it can be approximated through some heuristics, e.g. by solving a sequence of linear programs.
Remark 7**.**
In comparison to the classical concept of persistence of excitation (PE) in system identification, the richness property requiring that be large is a stronger property. In finite time, the quantitative persistence of excitation (called specifically sufficiency of excitation in this case) asks for the condition number of to be as close to as possible. The PE condition appears to be a global property of the matrix while the richness condition introduced here is a somewhat local property as it is basically counting the number of vectors pointing in any direction of the regression space.
3.2 Main results
Equipped with the measure of informativity introduced above, we can now state the main result of this paper, which stands as follows.
Theorem 8**.**
Let and with denoting the noise sequence in (1). Let be a function obeying P1-P4. Assume that the following condition is satisfied for some ,
[TABLE]
Then for any with being generated by system (1), it holds that
[TABLE]
where
[TABLE]
If in addition, is strictly increasing on , then
[TABLE]
{pf}
Let
[TABLE]
Then for any , it holds that
[TABLE]
Taking in particular and invoking the system equation (1), it follows that
[TABLE]
where we have posed . This implies that
[TABLE]
With for any , we have by the symmetry and nondecreasing properties of . As a consequence, \sum_{t\in I_{\varepsilon}^{0}}\exp\big{[}-\gamma\ell(v_{t})\big{]}\geq\sum_{t\in I_{\varepsilon}^{0}}\exp\big{[}-\gamma\ell(\varepsilon)\big{]}. On the other hand, by the fourth property of the function , hence implying that . Combining these observations allows us to write
[TABLE]
In the last equality we have partitioned the set into and with being the complement of in . Note from (7) that for all , so that . Plugging these observations into the above inequality yields
[TABLE]
By observing that , we can rearrange the above inequality in the form
[TABLE]
Now by exploiting the definition of , we can observe that
[TABLE]
Moreover since
[TABLE]
the assumption (13) guarantees that . Therefore since the term on the right hand side of (17) is negative, it holds that
[TABLE]
Then direct algebraic calculations lead to
[TABLE]
where is defined as in (15). Indeed, in virtue of the assumption (13), is positive. Hence we have
[TABLE]
Of course, if is monotonically increasing on then it is invertible and the error bound in (16) follows. ∎
A few comments follow from this result. A key assumption of the theorem is condition (13). What it requires is on the one hand, that the proportion of outliers be somehow small and on the other hand, that the regression data be rich in the sense that be large enough for a given nonzero . An important teaching of this condition is that the richer the data matrix , the larger the number of outliers that can be corrected by the estimator. We can interpret (13) as a sufficient condition for the stability of the estimator since it guarantees a bounded estimation error.
A second comment concerns the amplitude of the error bound given in (16). For the purpose of making this bound small, we need the constant to be close to one. Again we see that this is favored by a small number of outliers and a rich data set. An interesting special case is when , which occurs when the data are only affected by some outliers () and no dense noise. In this case the number defined in (15) reduces to
[TABLE]
which tend to suggest, since , that no exact recovery might be achieved once the data are affected by a single outlier unless we consider in (16) the limit case when . A similar observation was made in [6] in a comparable context. We will prove below that the MCE does not possess the exact recovery property, at least in the case when . In contrast, a robust estimator such as the LAD estimator (see, e.g., [1, 2]) is able to achieve exact recovery under a relatively significant proportion of nonzero errors.
Proposition 9**.**
Let Assumption 4 hold and assume that for any , . Take the function in (4) to be the square function, . For all , if then there exists a sequence such that when is generated from (1) under the action of .
{pf}
We start by observing that with being the square function, the cost is differentiable. Therefore, a necessary condition for to be in is that , where refers to the gradient. This, by using the system equation (1) and exploiting the assumption that for , can be translated into
[TABLE]
where w_{t}(\theta)=\exp\big{(}-\gamma\ell(v_{t}-x_{t}^{\top}(\theta-\theta^{o}))\big{)}. Note that the matrix on the left hand side of the equation above is finite regardless of the value of . Hence, in the event that , we would have
[TABLE]
with . Clearly, it is possible to find a nonzero sequence which does not meet this condition. Hence cannot be in for an arbitrary no matter how small the cardinality of is. ∎
We now discuss some special instances of Theorem 8 corresponding to two kernels which are frequently used for estimation. For convenience of the discussion, let us introduce the following notation. Let
[TABLE]
whenever and otherwise.
Remark 10**.**
The bound (16) allows for some degree of freedom in the choice of the parameter . For a given function assumed to be invertible on , a better bound can, in principle, be obtained as
[TABLE]
subject to and condition (13). Although such a minimum might not be easy to compute exactly, one can make the error bound a little tighter by performing for example some grid search. In the same manner one can envision optimizing the parameter of the estimator.
3.3 Laplacian kernel
The Maximum Laplacian Correntropy estimator (MCE-L) corresponds to the case where the function in (3) is taken to be such that . As a result, the function takes the form
[TABLE]
It is straightforward to see that the properties P1-P4 are satisfied by with . Theorem 8 can be specialized to this case as follows.
Corollary 11** (Laplacian kernel).**
Let be defined as in Theorem 8. Assume that the following condition is satisfied
[TABLE]
*for some .
Then for any with defined as in (18), it holds that*
[TABLE]
3.4 Gaussian kernel
The most used form of correntropy is the one based on the Gaussian kernel which, by omitting the normalizing factor, can be written in the form
[TABLE]
with . We will refer to the associated estimator as the maximum Gaussian correntropy estimator (MCE-G). Here, the function is defined by and according to Lemma 3, it satisfies the properties P1-P4. In particular, P4 is satisfied with . Moreover is clearly monotonic on . As a consequence, we get a corollary of Theorem 8 as follows.
Corollary 12** (Gaussian kernel).**
Let be defined as in Theorem 8. Assume that the following condition is satisfied
[TABLE]
*for some .
Then for any , it holds that*
[TABLE]
3.5 A remark on the error-in-variables scenario
We now consider the situation where only a noisy observation of the regressor vector in (1) is available for prediction. This scenario is referred to as the robust error-in-variable (EIV) regression problem. Then the predictor output is given by
[TABLE]
Indeed Theorem 8 remains valid for this case. To see this note that the system equation (1) can be rewritten as
[TABLE]
where . Then clearly the theorem applies to the EIV scenario with and , replaced respectively by and . One limitation however in this case is that for a given , the cardinality of the set is likely to be much smaller than in the situation where the regressors are noise-free.
4 Numerical experiments
The purpose of this section is to provide a numerical illustration of the richness measure (8) and of the estimation error bound (16). The system example considered for the experiment is of an FIR-type and is given by
[TABLE]
which can be written in the form (1) with and . For the data-generation experiment, assume that i.e., is sampled independently and identically from a zero-mean Gaussian distribution of unit variance. As for the noise signal , it is defined as with where refers to the uniform distribution and is a sequence of sparse noise with only a few nonzero elements (which are otherwise not constrained in magnitude); the nonzero elements of are here sampled from .
4.1 Illustration of estimates
We generate data pairs and carry out a comparison between three estimators: on the one hand, the maximum Laplacian correntropy estimator (MCE-L) and the maximum Gaussian correntropy estimator (MCE-G) and on the other hand, the Least Absolute Deviation (LAD) estimator (which is also called estimator). Recall that MCE-G and MCE-L involve non convex optimization. Here they are heuristically implemented as a reweighted iterative least squares estimator and as a reweighted estimator respectively. The results are represented in Figure 1 in term of average estimation error. What this suggests is that for fixed values of the design parameters and (see Eqs (18) and (21) for the roles of these parameters), LAD and MCE-L enjoy a similar performance for small amount of noise. But as the noise level increases, LAD shows better stability capabilities than the MCE-L. Note that overall MCE-G tends to perform best in the setting of this experiment as long as the magnitude of the dense noise is reasonable (SNR larger than dB). A possible justification for this is that squaring errors that contain outliers as in (21) cancel out their influence more forcefully than just taking their absolute value as in (18).
4.2 Estimation of richness measure
We provide a graphical representation of how the informativity measure may, for a given data matrix , evolve with respect to the dimensions of and the demanded degree of richness (See Figure 2). The estimated range for is based on Eq. (12). Here is formed from an FIR-type of regressors with an input sampled from a zero-mean and unit variance Gaussian distribution. Our experiments in this specific study tend to suggest that is a non decreasing function of the ratio and a decreasing function of . Moreover, the estimated range (gray regions in Fig. 2) gets wider when is large.
4.3 Estimates of error bounds
The goal here is twofold: (i) illustrate the variation of the estimation error bounds with respect to the magnitude of the dense noise in the special cases (20) and (23); (ii) assess how conservative the derived theoretical error bounds may be with respect to the empirical errors.
Increasing rates of the bounds
If for each level of noise, we select the parameter such that the product is kept constant, then the error bounds corresponding to both MCE-L and MCE-G have a linear rate of change with respect to as depicted in Figure 3. The increasing rate of the bound corresponding to MCE-L is larger than that of MCE-G for the current setting. Note that the computation of bounds made here is not connected to the experiment of Section 4.1.
Comparing theoretical bounds and empirical errors
It might be instructive to see how far away the theoretical error bounds may be from the empirical values. To study this aspect, let us consider a numerical experiment with a similar data-generating process as described in the beginning of Section 4. The dense noise level is set to which gives an SNR of about dB and the proportion of outliers is set to (which is small enough to enforce condition (13)). One difficulty in evaluating the theoretical bounds is that this requires evaluating which, as already discussed in Section 3.1, is a hard problem. Hence, is replaced here with the mean value of the lower and upper estimates displayed in (12). We then let the number of data vary from to and plot the empirical errors along with the bounds from (20) and (23) in Figure 4.
It is fair to observe that the theoretical bounds are conservative in the sense that they are generally higher than the true empirical errors. Here the ratio between the bounds and the true errors is about . Conservativeness is indeed a common feature for these types of results due to the various inequalities employed for the derivation. Nevertheless, the main interest of Theorem 8 is that it provides a sufficient condition for the robustness of the maximum correntropy estimator, a condition that depends explicitly on the degree of informativity of the regression data and on the proportion of outliers. Moreover, by expressing error bounds which involve explicitly the design parameters, the theorem gives insights into how to tune those parameters with the aim to improve estimation performance.
A further remark one can make is that the general formula for the error bound in (16) has a kind of universal feature in the following sense: since the bound does not involve the magnitude of the true (for an FIR-type system for example), it is in principle valid regardless of . Hence the relative error will be as smaller as the norm of the to-be-estimated parameter vector is larger.
5 Conclusion
In this paper we have proposed an analysis of the robustness properties of a correntropy maximization framework for regression problems. The class of estimators considered is quite general and include the Gaussian and Laplacian kernels as special cases. The contribution of the work consists in (i) deriving an appropriate notion of richness for the regression data; (ii) proving stability of the considered class of estimators under the derived richness condition when the data are subject to dense and sparse noise (outliers). Our main result states that if the regression data are rich enough and if the number of outliers is small in some sense, then the parametric estimation error is bounded. The results come with explicit bounds which, in default of being exactly computable, can be estimated with computable estimates.
{ack}
The author is grateful to the Associate Editor and the anonymous reviewers for constructive feedback.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] L. Bako. On a class of optimization-based robust estimators. IEEE Transactions on Automatic Control (To appear) , 2017.
- 2[2] L. Bako and H. Ohlsson. Analysis of a nonsmooth optimization approach to robust estimation. Automatica , 66:132–145, 2016.
- 3[3] R. K. Boel, M. R. James, and I. R. Petersen. Robustness and risk-sensitive filtering. IEEE Transactions on Automatic Control , 47:451–461, 2002.
- 4[4] E. Candès and P. A. Randall. Highly robust error correction by convex programming. IEEE Transactions on Information Theory , 54:2829–2840, 2006.
- 5[5] B. Chen, X. Liu, H. Zhao, N. Zheng, and J. Principe. Insights into the robustness of minimum error entropy estimation. IEEE Transactions on Neural Networks and Learning Systems (To appear) , 2016.
- 6[6] B. Chen, L.Xing, H. Zhao, B. Xu, and J. C. Principe. Robustness of maximum correntropy estimation against large outliers. https://arxiv.org/abs/1703.08065 , 2017.
- 7[7] B. Chen, L. Xing, J. Liang, N. Zheng, and J. C. Principe. Steady-state mean-square error analysis for adaptive filtering under the maximum correntropy criterion. IEEE Signal Processing Letters , 21:880–884, 2014.
- 8[8] S. Dey and J. B. Moore. Risk-sensitive filtering and smoothing via reference probability methods. IEEE Transactions on Automatic Control , 42:1587–1591, 1997.
