Learning Rates for Kernel-Based Expectile Regression
Muhammad Farooq, Ingo Steinwart

TL;DR
This paper analyzes a support vector machine approach for estimating conditional expectiles, establishing minimax optimal learning rates with Gaussian RBF kernels, improving upon previous kernel regression results.
Contribution
It introduces a new analysis of kernel-based expectile regression with optimal learning rates, leveraging advanced entropy bounds and calibration inequalities.
Findings
Achieves minimax optimal learning rates for kernel expectile regression.
Improves existing rates for kernel-based least squares regression.
Provides new theoretical tools for analyzing asymmetric loss functions.
Abstract
Conditional expectiles are becoming an increasingly important tool in finance as well as in other areas of applications. We analyse a support vector machine type approach for estimating conditional expectiles and establish learning rates that are minimax optimal modulo a logarithmic factor if Gaussian RBF kernels are used and the desired expectile is smooth in a Besov sense. As a special case, our learning rates improve the best known rates for kernel-based least squares regression in this scenario. Key ingredients of our statistical analysis are a general calibration inequality for the asymmetric least squares loss, a corresponding variance bound as well as an improved entropy number bound for Gaussian RBF kernels.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Mathematical Approximation and Integration · Gaussian Processes and Bayesian Inference
Learning Rates for Kernel-Based Expectile Regression
Muhammad Farooq and Ingo Steinwart
Institute for Stochastics and Applications
Faculty 8: Mathematics and Physics
University of Stuttgart
D-70569 Stuttgart Germany
{muhammad.farooq111This research is supported by Higher Education Commission (HEC) Pakistan (PS/OS-I/Batch- 2012/Germany/2012/3449) and German Academic Exchange Service (DAAD) scholarship program/-ID50015451., ingo.steinwart}@mathematik.uni-stuttgart.de
Abstract
Conditional expectiles are becoming an increasingly important tool in finance as well as in other areas of applications. We analyse a support vector machine type approach for estimating conditional expectiles and establish learning rates that are minimax optimal modulo a logarithmic factor if Gaussian RBF kernels are used and the desired expectile is smooth in a Besov sense. As a special case, our learning rates improve the best known rates for kernel-based least squares regression in this scenario. Key ingredients of our statistical analysis are a general calibration inequality for the asymmetric least squares loss, a corresponding variance bound as well as an improved entropy number bound for Gaussian RBF kernels.
1 Introduction
Given i.i.d samples drawn from some unknown probability distribution on , where is an arbitrary set and , the goal to explore the conditional distribution of given beyond the center of the distribution can be achieved by using both quantile and expectile regression. The well-known quantiles are obtained by minimizing asymmetric least absolute deviation (ALAD) loss function proposed by [26], whereas expectiles are computed by minimizing asymmetric least square (ALS) loss function
[TABLE]
for all and a fixed , see primarily [29] and also [19, 1] for further references. These expectiles have attracted considerable attention in recent years and have been applied successfully in many areas, for instance, in demography [31], in education [33] and extensively in finance [48, 23, 50, 25]. In fact, it has recently been shown (see, e.g. [6], [42]) that expectiles are the only risk measures that enjoy the properties of coherence and elicitability, see [21], and therefore they have been suggested as potentially better alternative to both Value at Risk (VaR) and Expected Shortfall (ES), see e.g. [46, 53, 6]. In order to see more applications of expectiles, we refer the interested readers to, e.g. [3, 34, 22].
Both quantiles and expectiles are special cases of so-called asymmetric -estimators (see [8]) and there exists one-to-one mapping between them (see, e.g. [19], [1] and [52] ), in general, however, expectiles do not coincide with quantiles. Hence, the choice between expectiles and quantiles mainly depends on the application at hand, as it is the case in the duality between the mean and the median. For example, if the goal is to estimate the (conditional) threshold for which -fraction of (conditional) observations lie below that threshold, then -quantile regression is the right choice. On the other hand, if one is interested to estimate the (conditional) threshold for which the average of below threshold excess information (deviations of observations from threshold) is times larger then above that threshold, then the -expectile regression is a preferable choice with , see [29, p. 823]. In other words, the focus in quantiles is the ordering of observations while expectiles account magnitude of the observations, which makes expectiles sensitive to the extreme values of the distribution and this sensitivity thus play a key role in computing the ES in finance. Since, estimating expectiles is computationally more efficient than quantiles, one can however use expectiles as a promising surrogate of quantiles in the situation where one is only interested to explore the conditional distribution.
As already mentioned above, -expectiles can be computed with the help of asymmetric risks
[TABLE]
where is the data generating distribution on and is some predictor. To be more precise, there exists a -almost surely unique function satisfying
[TABLE]
and equals -expectile of the conditional distribution for -almost all .
Some semiparametric and nonparametric methods for estimating conditional -expectiles with the help of empirical -risk have already been proposed in literature, see e.g. [32, 52, 51] for further details. Recently, [20] proposed an another nonparametric estimation method that belongs to the family of so-called kernel based regularized empirical risk minimization, which solves an optimization problem of the form
[TABLE]
Here, is a user specified regularization parameter, is a reproducing kernel Hilbert space (RKHS) over with reproducing kernel (see, e.g. [4] and [37, Chapter 4.2]) and denotes the empirical risk of , that is
[TABLE]
Since the ALS loss is convex, so is the optimization problem (3) and by [37, Lemma 5.1, Theorem 5.2] there always exits a unique that satisfies (3). Moreover, the solution of is of the form
[TABLE]
where for all , see [20] for further details. Learning method of the form (3) but with different loss functions have attracted many theoretical and algorithmic considerations, see for instance [49, 5, 9, 41, 17, 44] for least square regression, [38, 17] for quantile regression and [24, 40] for classification with hinge loss. In addition, [20] recently proposed an algorithm for solving (3), that is now a part of [43], and compared its performance to ER-Boost, see [51], which is another algorithm minimizing an empirical -risk. The main goal of this article is to complement the empirical findings of [20] with a detailed statistical analysis.
A typical way to access the quality of an estimator is to measure its distance to the target function , e.g. in terms of . For estimators obtained by some empirical risk minimization scheme, however, one can hardly ever estimate this -norm directly. Instead, the standard tools of statistical learning theory give bounds on the excess risk . Therefore, our first goal of this paper is to establish a so-called calibration inequality that relates both quantities. To be more precise, we will show in Theorem 3 that
[TABLE]
holds for all and some constant only depending on . In particular, (4) provides rates for as soon as we have established rates for . Furthermore, it is common knowledge in statistical learning theory that bounds on can be improved if so-called variance bounds are available. We will see in Lemma 4 that (4) leads to an optimal variance bound for whenever is bounded. Note that both (4) and the variance bound are independent of the considered expectile estimation method. In fact, both results are key ingredients for the statistical analysis of any expectile estimation method based on some form of empirical risk minimization.
As already indicated above, however, the main goal of this paper is to provide a statistical analysis of the SVM-type estimator given by (3). Since equals the least squares loss, any statistical analysis of (3) also provides results for SVMs using the least squares loss. The latter have already been extensively investigated in the literature. For example, learning rates for generic kernels can be found in [12, 13, 9, 41, 28] and the references therein. Among these articles, only [12, 41, 28] obtain learning rates in minimax sense under some specific assumptions. For example, [12] assumes that the target function , while [41, 28] establish optimal learning rates for the case in which does not contain the target function. In addition, [17] has recently established (essentially) asymptotically optimal learning rates for least square SVMs using Gaussian RBF kernels under the assumption that the target function is contained in some Sobolev or Besov space with smoothness index . A key ingredient of this work is to control the capacity of RKHS for Gaussian RBF kernel on the closed unit Euclidean ball by an entropy number bound
[TABLE]
see [37, Theorem 6.27], which holds for all and . Unfortunately, the constant derived from [37, Theorem 6.27] depends on in an unknown manner. As a consequence, [17] were only able to show learning rates of the form
[TABLE]
for all . To address this issue, we use [47, Lemma 4.5] to derive the following new entropy number bound
[TABLE]
which holds for all and and some constant only depending on . In other words, we establish an upper bound for whose dependence on is explicitly known. Using this new bound, we are then able to find improved learning rates of the form
[TABLE]
Clearly these new rates replace the nuisance factor of [17] by some logarithmic term, and up to this logarithmic factor our new rates are minimax optimal, see [17] for details. In addition, our new rates also hold for , that is for general expectiles.
The rest of this paper is organized as follows. In Section 2, some properties of the ALS loss function are established including the self-calibration inequality and variance bound. Section 3 presents oracle inequalities and learning rates for (3) and Gaussian RBF kernels. The proofs of our results can be found in Section 4.
2 Properties of the ALS Loss Function: Self-Calibration and Variance Bounds
This section contains some properties of the ALS loss function i.e. convexity, local Lipschitz continuity, a self-calibration inequality, a supremum bound and a variance bound. Throughout this section, we assume that is an arbitrary, non-empty set equipped with -algebra, and denotes a closed non-empty set. In addition, we assume that is the probability distribution on , is a regular conditional probability distribution on given and is a some distribution on . Furthermore, is the ALS loss defined by (1) and is a measurable function. It is trivial to prove that is convex in , and this convexity ensures that the optimization problem (3) is efficiently solvable. Moreover, by [37, Lemma 2.13] convexity of implies convexity of corresponding risks. In the following, we present the idea of clipping to restrict the prediction to the domain where , see e.g. [37, Definition 2.22].
Definition 1**.**
We say that a loss can be clipped at , if, for all , we have
[TABLE]
where denotes the clipped value of at , that is
[TABLE]
Moreover, we say that can be clipped if can be clipped at some .
Recall that this clipping assumption has already been utilized while establishing learning rates for SVMs, see for instance [10, 39, 40] for hinge loss and [11, 38] for pinball loss. It is trivial to show by convexity of together with [37, Lemma 2.23] that can be clipped at and has at least one global minimizer in . This also implies that for every . In other words, the clipping operation potentially reduces the risks. We therefore bound the risk of the clipped decision function rather than the risk , which we will see in details in Section 3. From a practical point of view, this means that the training algorithm for (3) remains unchanged and the evaluation of the resulting decision function requires only a slight change. For further details on algorithmic advantages of clipping for SVMs using the hinge loss and the ALS loss, we refer the reader to [40] and [20] respectively. It is also observed in [37, 41, 17] that -bounds, see Section 3, can be made smaller by clipping the decision function for some loss functions.
Let us further recall from [37, Definition 2.18] that a loss function is called locally Lipschitz continuous if for all there exists a constant such that
[TABLE]
In the following we denote for a given the smallest such constant by . The following lemma, which we will need for our proofs, shows that the ALS loss is locally Lipschitz continuous.
Lemma 2**.**
Let and , then the loss function is locally Lipschitz continuous with Lipschitz constant
[TABLE]
where .
For later use note that being locally Lipschitz continuous implies that is also a Nemitski loss in the sense of [37, Definition 18], and by [37, Lemma 2.13 and 2.19], this further implies that the corresponding risk is convex and locally Lipschitz continuous.
Empirical methods of estimating expectile using loss typically lead to the function for which is close to with high probability. The convexity of then ensures that approximates in a weak sense, namely in probability , see [35, Remark 3.18]. However, no guarantee on the speed of this convergence can be given, even if we know the convergence rate of . The following theorem addresses this issue by establishing a so-called calibration inequality for the excess -risk.
Theorem 3**.**
Let be the ALS loss function defined by (1) and be the distribution on . Moreover, assume that is the conditional -expectile for fixed . Then, for all , we have
[TABLE]
where and is defined in Lemma 2.
Note that the calibration inequality, that is the right-hand side of the inequality above in particular ensures that in whenever . In addition, the convergence rates can be directly translated. The inequality on the left shows that modulo constants the calibration inequality is sharp. We will use this left inequality when bounding the approximation error for Gaussian RBF kernels in the proof of Theorem 6.
At the end of this section, we present supremum and variance bounds of the -loss. Like the calibration inequality of Theorem 3 these two bounds are useful for analyzing the statistical properties of any -based empirical risk minimization scheme. In Section 3 we will illustrate this when establishing an oracle inequality for the SVM-type learning algorithm (3).
Lemma 4**.**
Let be non-empty set, be a closed subset where , and be a distribution on . Additionally, we assume that is the ALS loss and is the conditional -expectile for fixed . Then for all we have
- i)
**
- ii)
**
3 Oracle Inequalities and Learning Rates
In this section, we first introduce some notions related to kernels. We assume that is a measurable, symmetric and positive definite kernel with associated RKHS . Additionally, we assume that is bounded, that is, , which implies that consists of bounded functions with for all . In practice, we often consider SVMs that are equipped with well-known Gaussian RBF kernels for input domain , see [40, 20]. Recall that the latter are defined by
[TABLE]
where is called the width parameter that is usually determined in a data dependent way, i.e. by cross validation. By [37, Corollary 4.58] the kernel is universal on every compact set and in particular strictly positive definite. In addition, the RKHS of kernel is dense in for all and all distributions on , see [37, Proposition 4.60].
One requirement to establish learning rates is to control the capacity of RKHS . One way to do this is to estimate eigenvalues of a linear operator induced by kernel . To be more precise, given a kernel and a distribution on , we define the integral operator by
[TABLE]
for -almost all . In the following, we assume that . Recall [37, Theorem 4.27] that is compact, positive, self-adjoint and nuclear, and thus has at most countably many non-zero (and non-negative) eigenvalues . Ordering these eigenvalues (with geometric multiplicities) and extending the corresponding sequence by zeros, if there are only finitely many non-zero eigenvalues, we obtain the extended sequence of eigenvalues that satisfies [37, Theorem 7.29]. This summability implies that for some constant and , we have . By [41], this eigenvalues assumption can converge even faster to zero, that is, for , we have
[TABLE]
It turns out that the speed of convergence of influences learning rates for SVMs. For instance, [7] used (9) to establish learning rates for SVMs using hinge loss and [9, 28] for SVMs using least square loss.
Another way to control the capacity of RKHS is based on the concept of covering numbers or the inverse of covering numbers, namely, entropy numbers. To recall the latter, see [37, Definition A.5.26], let be a bounded, linear operator between the Banach spaces and , and be an integer. Then the -th (dyadic) entropy number of is defined by
[TABLE]
In the Hilbert space case, the eigenvalues and entropy number decay are closely related. For example, [36] showed that (9) is equivalent (modulo a constant only depending on ) to
[TABLE]
It is further shown in [36] that (10) implies a bound on average entropy numbers, that is, for empirical distribution associated to the data set , the average entropy number is
[TABLE]
which is used in [37, Theorem 7.24] to establish the general oracle inequality for SVMs. A bound of the form (10) was also established by [37, Theorem 6.27] for Gaussian RBF kernels and certain distributions having unbounded support. To be more precise, let be a closed unit Euclidean ball. Then for all and , there exists a constant such that
[TABLE]
which has been used by [17] to establish leaning rates for least square SVMs. Note that the constant depends on in an unknown manner. To address this issue, we use [47, lemma 4.5] and derive an improved entropy number bound in the following theorem by establishing an upper bound for whose dependence on is explicitly known. We will further see in Corollary 8 that this improved bound leads us to achieve better learning rates than the one obtained by [17].
Theorem 5**.**
Let be a closed Euclidean ball. Then there exists a constant , such that, for all , and , we have
[TABLE]
Another requirement for establishing learning rates is to bound the approximation error function considering RKHS for Gaussian RBF kernel . If the distribution is such that , then the approximation error function is defined by
[TABLE]
For , the approximation error function quantifies how well an infinite sample -SVM with RKHS , that is, approximates the optimal risk . By [37, Lemma 5.15], one can show that if is dense in . In general, however, the speed of convergence can not be faster than and this rate is achieved, if and only if, there exists an such that , see [37, Lemma 5.18].
In order to bound , we first need to know one important feature of the target function , namely, the regularity which, roughly speaking, measures the smoothness of the target function. Different function spaces norms e.g. Hölder norms, Besov norms or Triebel-Lizorkin norms can be used to capture this regularity. In this work, following [17, 27], we assume that the target function is in a Sobolev or a Besov space. Recall [45, Definition 5.1] and [2, Definition 3.1 and 3.2] that for any integer , and a subset with non-empty interior, the Sobolev space of order is defined by
[TABLE]
with the norm
[TABLE]
where is the -th weak partial derivative for multi-index of modulus . In other words, the Sobolev space is the space of functions with sufficiently many derivatives and equipped with a norm that measures both the size and the regularity of the contained functions. Note that is a Banach space, see [45, Lemma 5.2]. Moreover, by [2, Theorem 3.6], is separable if , and is uniformly convex and reflexive if . Furthermore, for , is a separable Hilbert space that we denote by . Despite the underlined advantages, Sobolev spaces can not be immediately applied when is non-integral or when , however, the smoothness spaces for these extended parameters are also needed when engaging nonlinear approximation. This shortcoming of Sobolev spaces is covered by Besov spaces that bring together all functions for which the modulus of smoothness have a common behavior. Let us first recall [16, Section 2] and [15, Section 2] that for a subset with non-empty interior, a function with for all and , the modulus of smoothness of order of a function is defined by
[TABLE]
where the -th difference given by
[TABLE]
for , is used to measure the smoothness. Note that as , which means that the faster this convergence to 0 the smoother is . For more details on properties of the modulus of smoothness, we refer the reader to [30, Chapter 4.2]. Now for , , , the Besov space based on modulus of smoothness for domain , see for instance [14, Section 4.5], [30, Chapter 4.3] and [16, Section 2], is defined by
[TABLE]
where the semi-norm is given by
[TABLE]
and for , the semi-norm is defined by
[TABLE]
In other words, Besov spaces are collections of functions with common smoothness. For more general definition of Besov-like spaces, we refer to [27, Section 4.1]. Note that is the norm of , see e.g. [16, Section 2] and [15, Section 2]. Furthermore, for different values of give equivalent norms of , which remains true for , see [16, Section 2]. It is well known, see e.g [30, Section 4.1], that for all , , where for the Besov space is the same as the Sobolev space.
In the next step, we find a function such that both the regularization term and the excess risk are small. For this, we define the function , see [17], by
[TABLE]
for all , and . Additionally, we assume that there exists a function satisfies and . Then is defined by
[TABLE]
With these preparation, we now establish an upper bound for the approximate error function .
Theorem 6**.**
Let be the ALS loss defined by (1), be the probability distribution on , and be the marginal distribution of onto such that and . Moreover, assume that the conditional -expectile satisfies as well as for some . In addition, assume that is the Gaussian RBF kernel over with associated RKHS . Then for all and , we have
[TABLE]
where is a constant depending on and , and the constant .
Clearly, the upper bound of the approximation error function in Theorem 6 depends on the regularization parameter , the kernel width , and the smoothness parameter of the target function . Note that in order to shrink the right-hand side we need to let . However, this would let the first term go to infinity unless we simultaneously let with a sufficient speed. Now using [37, Theorem 7.24] together with Lemma 4, Theorem 6 and the entropy number bound (12), we establish oracle inequality of SVMs for in the following theorem.
Theorem 7**.**
Consider the assumptions of Theorem 6 and additionally assume that for . Then, for all and , the SVM using the RKHS and the ALS loss function satisfies
[TABLE]
with probability not less than . Here is some constant independent of and .
It is well known that there exists a relationship between Sobolev spaces and the scale of Besov spaces, that is, , whenever and , see for instance [18, p.25 and p.44]. In particular, for , we have with equivalent norms. In addition, by [17, p.7] we have . Thus, Theorem 7 also holds for decision functions with and .
By assuming some suitable values for and that depends on data size , the smoothness parameter , and the dimension , we obtain learning rates for learning problem (3) in the following corollary.
Corollary 8**.**
Under the assumptions of Theorem 7 and with
[TABLE]
where and are user specified constants, we have, for all and ,
[TABLE]
with probability not less than .
Note that learning rates in Corollary 8 depend on the choice of and , where the kernel width requires knowing which, in practice, is not available. However, [37, Chapter 7.4], [41], [17] and [38] showed that one can achieve the same learning rates adaptively, i.e. without knowing . Let us recall [37, Definition 6.28] that describes a method to select and , which in some sense is a simplification of the cross-validation method.
Definition 9**.**
Let be a RKHS over and and be the sequences of finite subsets . Given a data set , we define
[TABLE]
where and . Then use as a training set to compute the SVM decision function
[TABLE]
and use to determine by choosing such that
[TABLE]
Every learning method that produce the resulting decision functions is called a training validation SVM with respect to .
In the next Theorem, we use this training-validation SVM (TV-SVM) approach for suitable candidate sets and with , and establish learning rates similar to (16).
Theorem 10**.**
With the assumptions of Theorem 7, let and be the sequences of finite subsets such that is an -net of and is an -net of with polynomially growing cardinalities and in . Then for all , the TV-SVM produce that satisfies
[TABLE]
where is a constant independent of and .
So far we have only considered the case of bounded noise with known bounds, that is, where is known. In practice, is usually unknown and in this situation, one can still achieve the same learning rates by simply increasing slowly. However, more interesting is the case of unbounded noise. In the following we treat this case for distributions for which there exist constants and such that
[TABLE]
for all . In other words, the tails of the response variable decay sufficiently fast. It is shown in [17] by examples that such an assumption is realistic. For instance, if , the assumption (17) is satisfied for , see [17, Example 3.7], and for the case where has the density whose tails decay like , the assumption (17) holds for , see [17, Example 3.8].
With this additional assumption, we present learning rates for the case of unbounded noise in the following theorem.
Theorem 11**.**
Let and be a probability distribution on such that . Moreover, assume that the -expectile satisfies for -almost all , and both and for some . In addition, assume that (17) holds for all . We define
[TABLE]
where and are user-specified constants. Moreover, for some fixed and we define and . Furthermore, we consider the SVM that clips decision function at after training. Then there exists a independent of , and such that
[TABLE]
holds with probability not less than .
Note that the assumption (17) on the tail of the distribution does not influence learning rates achieved in the Corollary 8. Furthermore, we can also achieve same rates adaptively using TV-SVM approach considered in Theorem 10 provided that we have upper bound of the unknown parameter , which depends on the distribution , see [17] where this dependency is explained with some examples.
Let us now compare our results with the oracle inequalities and learning rates established by [17] for least square SVMs. This comparison is justifiable because a) the least square loss is a special case of -loss for , b) the target function is assumed to be in the Sobolev or Besov space similar to [17], and c) the supremum and the variance bounds for with are the same as the ones used by [17]. Furthermore, recall that [17] used the entropy number bounds (11) to control the capacity of the RKHS which contains a constant depending on in an unknown manner. As a result, they obtained a leading constant in their oracle inequality, see [17, Theorem 3.1] for which no upper bound can be determined explicitly. We cope this problem by establishing an improved entropy number bound (12) which not only provides the upper bound for but also helps to determine the value of the constant in the oracle inequality (15) explicitly. As a consequence we can improve their learning rates of the form , where , by
[TABLE]
In other words, the nuisance parameter from [17] is replaced by the logarithmic term . Moreover, our learning rates, up to this logarithmic term, are minimax optimal, see e.g. the discussion in [17]. Finally note that unlike [17] we have not only established learning rates for the least squares case but actually for all .
4 Proofs
4.1 Proofs of Section 2
Proof of Lemma 2. We define by
[TABLE]
Clearly, is convex and thus [37, Lemma A.6.5] shows that is locally Lipschitz continuous. Moreover, we have
[TABLE]
where . A simple consideration shows that this estimate is also sharp. ∎
In order to prove Theorem 3 recall that the risk in (2) uses regular conditional probability , which enable us to computed by treating the inner and the outer integrals separately. Following [37, Definition 3.3, Definition 3.4], we therefore use inner -risks as a key ingredient for establishing self-calibration inequalities.
Definition 12**.**
Let be the ALS loss function defined by (1) and be a distribution on . Then the inner -risks of are defined by
[TABLE]
and the minimal inner -risk is
[TABLE]
In the latter definition, the inner risks for a suitable classes of distributions on are considered as a template for . From this, we immediately can obtain the risk of function , i.e. . Moreover, by [37, Lemma 3.4], the optimal risk can be obtained by minimizing the inner -risks, i.e. . consequently, the excess -risk, when , is obtained by
[TABLE]
Besides some technical advantages, this approach makes the analysis rather independent of the specific distribution . In the following theorem, we use this approach and establish the lower and the upper bound of excess inner -risks.
Theorem 13**.**
Let be the ALS loss function defined by (1) and be a distribution on with . For a fixed and for all , we have
[TABLE]
where and is defined in Lemma 2.
Proof of Theorem 13. Let us fix . Then for a distribution on satisfies , the -expectile , according to [29], is the only solution of
[TABLE]
Let us now compute the excess inner risks of with respect to . To this end, we fix a . Then we have
[TABLE]
and
[TABLE]
By Definition 12 and using (22), we obtain
[TABLE]
and this leads to the following excess inner -risk
[TABLE]
Let us define , then (23) leads to the following lower bound of excess inner -risk when :
[TABLE]
Likewise, the excess inner -risk when is
[TABLE]
that also leads to the lower bound (24). Now, for the proof of upper bound of the excess inner -risks, we define . Then (23) leads to the following upper bound of excess inner -risks when :
[TABLE]
Analogously, for the case of , (25) also leads to the upper bound (26) for excess inner -risks. ∎
Proof of Theorem 3. For a fixed , we write and . By Theorem 13, for , we then immediately obtain
[TABLE]
Integrating with respect to leads to the assertion. ∎
Proof of Lemma 4. i) Since can be clipped at and the conditional -expectile satisfies almost surely. Then
[TABLE]
for all and all .
ii) Using the locally Lipschitz continuity of the loss and Theorem 3, we obtain
[TABLE]
∎
4.2 Proofs of Section 3
Proof of Theorem 5. By [47, Lemma 4.5], the -log covering numbers of unit ball of the Gaussian RKHS for all and satisfy
[TABLE]
where is a constant depending only on . From this, we conclude that
[TABLE]
Let . In order to obtain the optimal value of , we differentiate it with respect to
[TABLE]
and set which gives
[TABLE]
By plugging into , we obtain
[TABLE]
and consequently, -log covering numbers (27) are
[TABLE]
where . Now, by inverse implication of [37, Lemma 6.21], see also [37, Exercise 6.8], the bound on entropy number of the Gaussian RBF kernel is
[TABLE]
for all , . ∎
Proof of Theorem 6. The assumption and [17, Theorem 2.3] immediately yield that , i.e. is contained in RKHS . Furthermore, [17, Theorem 2.3] leads to the following upper bound of the regularization term
[TABLE]
In the next step, we bound the excess risk. By [17, Theorem 2.2], the upper bound for -distance between and is
[TABLE]
where , see [17, p.27], is constant only depending on and is the Lebesgue density. Now using Theorem 13 together with (28), we obtain
[TABLE]
where . With these results, we finally obtain
[TABLE]
where . ∎
In order to prove the main oracle inequality given in Theorem 7, we need the following lemma.
Lemma 14**.**
The function defined by
[TABLE]
is convex. Moreover, we have .
Proof.
By considering the linear transformation , it is suffices to show that the function defined by
[TABLE]
is convex. To solve the latter, we first compute the first and second derivative of with respect to , that is:
[TABLE]
and
[TABLE]
Since , it is not hard to see that all terms in are strictly positive. Thus and hence is convex. Furthermore, by convexity of , it is easy to find that
[TABLE]
∎
Proof of Theorem 7. The assumption and [17, Theorem 2.3] yield that
[TABLE]
holds for all . This implies that, for all , we have
[TABLE]
and hence we conclude that . Now, by plugging the result of Theorem 6 together with a=(3K)^{\frac{1}{2p}}\Big{(}\frac{d+1}{ep}\Big{)}^{\frac{d+1}{2p}} from Theorem 5 and from Lemma 4, into [37, Theorem 7.23], we obtain
[TABLE]
where and are from Theorem 6, is a constant from [37, Theorem 7.23] that depends on , , and C_{d}:=3K\Big{(}\frac{d+1}{e}\Big{)}^{d+1} is a constant only depending on . Let us assume that . Since and ,thus (4.2) becomes
[TABLE]
We now consider the constant in more detail. To this end, by using the Lipschitz constant from Lemma 2 and the supremum bound from Lemma 4 , the value of is, see [37, Theorem 7.23]:
[TABLE]
where the constants and are derived in the proof of [37, Theorem 7.16], that is
[TABLE]
and by [37, Lemma 7.15], we have
[TABLE]
Here we are interested to bound for . For this, we first need to bound the constants and . We start with and obtain the following bound for .
[TABLE]
where we used \Big{(}\frac{1-p}{p}\Big{)}^{p}=\Big{(}\frac{1}{p}-1\Big{)}^{p}\leq e for all , and Lemma 14. Now the bound for is the following:
[TABLE]
Analogously, the bound for the constant is:
[TABLE]
By plugging and into (4.2), we thus obtain
[TABLE]
and by plugging this result into (31), we obtain
[TABLE]
where is a constant independent of and . ∎
Proof of Corollary 8. For all , Theorem 7 yields
[TABLE]
with probability not less than and a constant . Using the sequences and , we obtain
[TABLE]
where the positive constant is independent of . ∎
Before we can proof the Theorem 10, we need the following technical lemma.
Lemma 15**.**
Let , be a constant, be a finite set such that there exists a with . Moreover assume that and is a finite -net of . Then for and we have
[TABLE]
where is a constant independent of .
Proof.
Let us assume that and , and for all and for all . We thus obtain
[TABLE]
where . It is not hard to see that the function is optimal at , where is a constant only depends on and . Furthermore, with , we see that for all . In addition, there exits an index such that . Consequently, we have . Using this result in (4.2), we obtain
[TABLE]
where is a constant. ∎
Proof of Theorem 10. The proof of this theorem is the literal repetition of the proof of [17, Theorem 3.6 ], however, we present here for the sake of completeness. Let us define , then for all , Theorem 7 yields
[TABLE]
with probability not less than . Now define and , then by using [37, Theorem 7.2] and Lemma 15, we obtain
[TABLE]
with probability not less than . ∎
Proof of Theorem 11. By (17), we obtain
[TABLE]
This implies that
[TABLE]
This leads us to conclude with probability not less than that the SVM for ALS loss with belatedly clipped decision function at is actually a clipped regularized empirical risk minimization (CR-ERM) in the sense of [37, Definition 7.18]. Consequently, [37, Theorem 7.20] holds for modulo a set of probability not less than . From Theorem 7, we then obtain
[TABLE]
with probability not less than . As in the proof of Corollary (8) and by using the inequality , for and , we finally obtain
[TABLE]
for all with probability not less than . Choosing leads to the assertion. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. Abdous and B. Remillard. Relating quantiles and expectiles under weighted-symmetry. Ann. Inst. Statist. Math. , 47:371–384, 1995. http://dx.doi.org/10.1007/bf 00773468 . · doi ↗
- 2[2] R. A. Adams and J. J. F. Fournier. Sobolev Spaces . Academic Press, New York, 2nd edition, 2003. https://doi.org/10.1016/s 0079-8169(03)x 8001-0 . · doi ↗
- 3[3] Y. Aragon, S. Casanova, R. Chambers, and E. Leconte. Conditional ordering using nonparametric expectiles. J. Off. Stat. , 21:617–633, 2005. http://www.jos.nu/Articles/abstract.asp?article=214617 .
- 4[4] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc. , 68:337–404, 1950. .
- 5[5] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. J. complexity , 23:52–72, 2007. https://doi.org/10.1016/j.jco.2006.07.001 . · doi ↗
- 6[6] F. Bellini, B. Klar, A. Müller, and R. E. Gianin. Generalized quantiles as risk measures. Insurance Math. Econom. , 54:41–48, 2014. http://dx.doi.org/10.1016/j.insmatheco.2013.10.015 . · doi ↗
- 7[7] G. Blanchard, O. Bousquet, and P. Massart. Statistical performance of support vector machines. Ann. Statist. , pages 489–531, 2008. https://doi.org/10.1214/009053607000000839 . · doi ↗
- 8[8] J. Breckling and R. Chambers. M-quantiles. Biometrika , 75:761–771, 1988. http://dx.doi.org/10.2307/2336317 . · doi ↗
