Learning Rates for Kernel-Based Expectile Regression

Muhammad Farooq; Ingo Steinwart

arXiv:1702.07552·stat.ML·February 28, 2017

Learning Rates for Kernel-Based Expectile Regression

Muhammad Farooq, Ingo Steinwart

PDF

Open Access

TL;DR

This paper analyzes a support vector machine approach for estimating conditional expectiles, establishing minimax optimal learning rates with Gaussian RBF kernels, improving upon previous kernel regression results.

Contribution

It introduces a new analysis of kernel-based expectile regression with optimal learning rates, leveraging advanced entropy bounds and calibration inequalities.

Findings

01

Achieves minimax optimal learning rates for kernel expectile regression.

02

Improves existing rates for kernel-based least squares regression.

03

Provides new theoretical tools for analyzing asymmetric loss functions.

Abstract

Conditional expectiles are becoming an increasingly important tool in finance as well as in other areas of applications. We analyse a support vector machine type approach for estimating conditional expectiles and establish learning rates that are minimax optimal modulo a logarithmic factor if Gaussian RBF kernels are used and the desired expectile is smooth in a Besov sense. As a special case, our learning rates improve the best known rates for kernel-based least squares regression in this scenario. Key ingredients of our statistical analysis are a general calibration inequality for the asymmetric least squares loss, a corresponding variance bound as well as an improved entropy number bound for Gaussian RBF kernels.

Equations261

\displaystyle L_{\tau}(y,t)=\left\{\begin{array}[]{ll}(1-\tau)(y-t)^{2}\,,&\hskip 25.83325pt\text{if}\hskip 8.61108pty<t\,,\\ \tau(y-t)^{2}\,,&\hskip 25.83325pt\text{if}\hskip 8.61108pty\geqslant t\,,\end{array}\right.

\displaystyle L_{\tau}(y,t)=\left\{\begin{array}[]{ll}(1-\tau)(y-t)^{2}\,,&\hskip 25.83325pt\text{if}\hskip 8.61108pty<t\,,\\ \tau(y-t)^{2}\,,&\hskip 25.83325pt\text{if}\hskip 8.61108pty\geqslant t\,,\end{array}\right.

R_{L_{τ}, P} (f) := \int_{X \times Y} L_{τ} (y, f (x)) d P (x, y) = \int_{X} \int_{Y} L_{τ} (y, f (x)) d P (y ∣ x) d P_{X} (x),

R_{L_{τ}, P} (f) := \int_{X \times Y} L_{τ} (y, f (x)) d P (x, y) = \int_{X} \int_{Y} L_{τ} (y, f (x)) d P (y ∣ x) d P_{X} (x),

R_{L_{τ}, P} (f_{L_{τ}, P}^{⋆}) = R_{L_{τ}, P}^{⋆} := in f {R_{L_{τ}, P} (f) ∣ f : X \to R \mbox m e a s u r ab l e},

R_{L_{τ}, P} (f_{L_{τ}, P}^{⋆}) = R_{L_{τ}, P}^{⋆} := in f {R_{L_{τ}, P} (f) ∣ f : X \to R \mbox m e a s u r ab l e},

f_{D, λ} = ar g f \in H min λ ∥ f ∥_{H}^{2} + R_{L_{τ}, D} (f) .

f_{D, λ} = ar g f \in H min λ ∥ f ∥_{H}^{2} + R_{L_{τ}, D} (f) .

R_{L_{τ}, D} (f) = \frac{1}{n} i = 1 \sum n L_{τ} (y_{i}, f (x_{i})) .

R_{L_{τ}, D} (f) = \frac{1}{n} i = 1 \sum n L_{τ} (y_{i}, f (x_{i})) .

f_{D, λ} := i = 1 \sum n (α_{i}^{*} - β_{i}^{*}) K (x_{i}, \cdot),

f_{D, λ} := i = 1 \sum n (α_{i}^{*} - β_{i}^{*}) K (x_{i}, \cdot),

∥ f_{D} - f_{L_{τ}, P}^{⋆} ∥_{L_{2} (P_{X})} \leq c_{τ} R_{L_{τ}, P} (f_{D}) - R_{L_{τ}, P}^{⋆},

∥ f_{D} - f_{L_{τ}, P}^{⋆} ∥_{L_{2} (P_{X})} \leq c_{τ} R_{L_{τ}, P} (f_{D}) - R_{L_{τ}, P}^{⋆},

e_{i} (id : H_{γ} (X) \to l_{\infty} (X)) \leq c_{p, d} (X) γ^{- \frac{d}{p}} i^{- \frac{1}{p}},

e_{i} (id : H_{γ} (X) \to l_{\infty} (X)) \leq c_{p, d} (X) γ^{- \frac{d}{p}} i^{- \frac{1}{p}},

n^{- \frac{2 α}{2 α + d} + ξ}

n^{- \frac{2 α}{2 α + d} + ξ}

e_{i} (id : H_{γ} (X) \to l_{\infty} (X)) \leq (3 K)^{\frac{1}{p}} (\frac{d + 1}{e p})^{\frac{d + 1}{p}} γ^{- \frac{d}{p}} i^{- \frac{1}{p}},

e_{i} (id : H_{γ} (X) \to l_{\infty} (X)) \leq (3 K)^{\frac{1}{p}} (\frac{d + 1}{e p})^{\frac{d + 1}{p}} γ^{- \frac{d}{p}} i^{- \frac{1}{p}},

(lo g n)^{d + 1} n^{- \frac{2 α}{2 α + d}} .

(lo g n)^{d + 1} n^{- \frac{2 α}{2 α + d}} .

L (y, \wideparen t) \leq L (y, t),

L (y, \wideparen t) \leq L (y, t),

\displaystyle\wideparen{t}:=\left\{\begin{array}[]{ll}-M&\hskip 25.83325pt\text{if}\hskip 8.61108ptt<-M\,,\\ t&\hskip 25.83325pt\text{if}\hskip 8.61108ptt\in[-M,M]\,,\\ M&\hskip 25.83325pt\text{if}\hskip 8.61108ptt>M\,.\end{array}\right.

\displaystyle\wideparen{t}:=\left\{\begin{array}[]{ll}-M&\hskip 25.83325pt\text{if}\hskip 8.61108ptt<-M\,,\\ t&\hskip 25.83325pt\text{if}\hskip 8.61108ptt\in[-M,M]\,,\\ M&\hskip 25.83325pt\text{if}\hskip 8.61108ptt>M\,.\end{array}\right.

y \in Y sup ∣ L (y, t) - L (y, t^{'})∣ \leq c_{a} ∣ t - t^{'} ∣, t, t^{'} \in [- a, a] .

y \in Y sup ∣ L (y, t) - L (y, t^{'})∣ \leq c_{a} ∣ t - t^{'} ∣, t, t^{'} \in [- a, a] .

∣ L_{τ} ∣_{1, M} = C_{τ} 4 M,

∣ L_{τ} ∣_{1, M} = C_{τ} 4 M,

C_{τ}^{- 1/2} (R_{L_{τ}, P} (f) - R_{L_{τ}, P}^{⋆})^{1/2} \leq ∥ f - f_{L_{τ}, P}^{⋆} ∥_{L_{2} (P_{X})} \leq c_{τ}^{- 1/2} (R_{L_{τ}, P} (f) - R_{L_{τ}, P}^{⋆})^{1/2},

C_{τ}^{- 1/2} (R_{L_{τ}, P} (f) - R_{L_{τ}, P}^{⋆})^{1/2} \leq ∥ f - f_{L_{τ}, P}^{⋆} ∥_{L_{2} (P_{X})} \leq c_{τ}^{- 1/2} (R_{L_{τ}, P} (f) - R_{L_{τ}, P}^{⋆})^{1/2},

k_{γ} (x, x^{'}) := exp (- γ^{- 2} ∥ x - x^{'} ∥_{2}^{2}),

k_{γ} (x, x^{'}) := exp (- γ^{- 2} ∥ x - x^{'} ∥_{2}^{2}),

T_{k} f (\cdot) := \int_{X} k (x, \cdot) f (x) d μ (x)

T_{k} f (\cdot) := \int_{X} k (x, \cdot) f (x) d μ (x)

λ_{i} (T_{k}) \leq a i^{- \frac{1}{p}}, i \geq 1.

λ_{i} (T_{k}) \leq a i^{- \frac{1}{p}}, i \geq 1.

e_{i} (T) := in f {ϵ > 0 : \exists x_{1}, \dots, x_{2^{i - 1}} such that T B_{E} \subset \cup_{j = 1}^{2^{i - 1}} (x_{j} + ϵ B_{F})} .

e_{i} (T) := in f {ϵ > 0 : \exists x_{1}, \dots, x_{2^{i - 1}} such that T B_{E} \subset \cup_{j = 1}^{2^{i - 1}} (x_{j} + ϵ B_{F})} .

e_{i} (id : H \to L_{2} (P_{X})) \leq a i^{- \frac{1}{2 p}}, i \geq 1,

e_{i} (id : H \to L_{2} (P_{X})) \leq a i^{- \frac{1}{2 p}}, i \geq 1,

E_{D_{X} \sim P_{X}^{n}} e_{i} (id : H \to L_{2} (P_{X})) \leq a i^{- \frac{1}{2 p}}, i \geq 1,

E_{D_{X} \sim P_{X}^{n}} e_{i} (id : H \to L_{2} (P_{X})) \leq a i^{- \frac{1}{2 p}}, i \geq 1,

e_{i} (id : H_{γ} (X) \to l_{\infty} (X)) \leq c_{p, d} (X) γ^{- \frac{d}{p}} i^{- \frac{1}{p}},

e_{i} (id : H_{γ} (X) \to l_{\infty} (X)) \leq c_{p, d} (X) γ^{- \frac{d}{p}} i^{- \frac{1}{p}},

e_{i} (id : H_{γ} (X) \to l_{\infty} (X)) \leq (3 K)^{\frac{1}{p}} (\frac{d + 1}{e p})^{\frac{d + 1}{p}} γ^{- \frac{d}{p}} i^{- \frac{1}{p}}

e_{i} (id : H_{γ} (X) \to l_{\infty} (X)) \leq (3 K)^{\frac{1}{p}} (\frac{d + 1}{e p})^{\frac{d + 1}{p}} γ^{- \frac{d}{p}} i^{- \frac{1}{p}}

A (λ) := f \in H_{γ} in f λ ∥ f ∥_{H_{γ}}^{2} + R_{L_{τ}, P} (f) - R_{L_{τ}, P}^{⋆} .

A (λ) := f \in H_{γ} in f λ ∥ f ∥_{H_{γ}}^{2} + R_{L_{τ}, P} (f) - R_{L_{τ}, P}^{⋆} .

W_{p}^{k} (Ω) := {f \in L_{p} (Ω) : D^{(α)} f \in L_{p} (Ω) exists for all α \in N_{0}^{d} with ∣ α ∣ \leq k},

W_{p}^{k} (Ω) := {f \in L_{p} (Ω) : D^{(α)} f \in L_{p} (Ω) exists for all α \in N_{0}^{d} with ∣ α ∣ \leq k},

\displaystyle\|f\|_{W_{p}^{k}(\Omega)}:=\left\{\begin{array}[]{ll}\Big{(}\sum_{\lvert\alpha\rvert\leq k}\|D^{(\alpha)}f\|_{L_{p}(\Omega)}^{p}\Big{)}^{\frac{1}{p}}\,,&\hskip 25.83325pt\text{if}\hskip 8.61108ptp\in[1,\infty)\,,\\ \max\sum_{\lvert\alpha\rvert\leq k}\|D^{(\alpha)}f\|_{L_{\infty}(\Omega)}\,,&\hskip 25.83325pt\text{if}\hskip 8.61108ptp=\infty\,,\end{array}\right.

\displaystyle\|f\|_{W_{p}^{k}(\Omega)}:=\left\{\begin{array}[]{ll}\Big{(}\sum_{\lvert\alpha\rvert\leq k}\|D^{(\alpha)}f\|_{L_{p}(\Omega)}^{p}\Big{)}^{\frac{1}{p}}\,,&\hskip 25.83325pt\text{if}\hskip 8.61108ptp\in[1,\infty)\,,\\ \max\sum_{\lvert\alpha\rvert\leq k}\|D^{(\alpha)}f\|_{L_{\infty}(\Omega)}\,,&\hskip 25.83325pt\text{if}\hskip 8.61108ptp=\infty\,,\end{array}\right.

w_{s, L_{p} (Ω)} (f, t) = ∥ h ∥_{2} \leq t sup ∥ △_{h}^{s} (f, \cdot) ∥_{L_{p} (Ω)}, t \geq 0,

w_{s, L_{p} (Ω)} (f, t) = ∥ h ∥_{2} \leq t sup ∥ △_{h}^{s} (f, \cdot) ∥_{L_{p} (Ω)}, t \geq 0,

\displaystyle\triangle_{h}^{s}(f,x,\Omega):=\left\{\begin{array}[]{ll}\sum_{i=0}^{s}\binom{r}{i}(-1)^{r-i}f(x+ih)&\hskip 25.83325pt\text{if}\hskip 8.61108ptx,x+h,\ldots,x+sh\in\Omega\,,\\ 0,&\hskip 25.83325pt\text{otherwise}\,,\end{array}\right.

\displaystyle\triangle_{h}^{s}(f,x,\Omega):=\left\{\begin{array}[]{ll}\sum_{i=0}^{s}\binom{r}{i}(-1)^{r-i}f(x+ih)&\hskip 25.83325pt\text{if}\hskip 8.61108ptx,x+h,\ldots,x+sh\in\Omega\,,\\ 0,&\hskip 25.83325pt\text{otherwise}\,,\end{array}\right.

B_{p, q}^{α} (Ω) := {f \in L_{p} (Ω) : ∣ f ∣_{B_{p, q}^{α} (Ω)} < \infty},

B_{p, q}^{α} (Ω) := {f \in L_{p} (Ω) : ∣ f ∣_{B_{p, q}^{α} (Ω)} < \infty},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Mathematical Approximation and Integration · Gaussian Processes and Bayesian Inference

Full text

Learning Rates for Kernel-Based Expectile Regression

Muhammad Farooq and Ingo Steinwart

Institute for Stochastics and Applications

Faculty 8: Mathematics and Physics

University of Stuttgart

D-70569 Stuttgart Germany

{muhammad.farooq111This research is supported by Higher Education Commission (HEC) Pakistan (PS/OS-I/Batch- 2012/Germany/2012/3449) and German Academic Exchange Service (DAAD) scholarship program/-ID50015451., ingo.steinwart}@mathematik.uni-stuttgart.de

Abstract

Conditional expectiles are becoming an increasingly important tool in finance as well as in other areas of applications. We analyse a support vector machine type approach for estimating conditional expectiles and establish learning rates that are minimax optimal modulo a logarithmic factor if Gaussian RBF kernels are used and the desired expectile is smooth in a Besov sense. As a special case, our learning rates improve the best known rates for kernel-based least squares regression in this scenario. Key ingredients of our statistical analysis are a general calibration inequality for the asymmetric least squares loss, a corresponding variance bound as well as an improved entropy number bound for Gaussian RBF kernels.

1 Introduction

Given i.i.d samples $D:=((x_{1},y_{1}),\ldots,(x_{n},y_{n}))$ drawn from some unknown probability distribution $\mathrm{P}$ on $X\times Y$ , where $X$ is an arbitrary set and $Y\subset\mathbb{R}$ , the goal to explore the conditional distribution of $Y$ given $x\in X$ beyond the center of the distribution can be achieved by using both quantile and expectile regression. The well-known quantiles are obtained by minimizing asymmetric least absolute deviation (ALAD) loss function proposed by [26], whereas expectiles are computed by minimizing asymmetric least square (ALS) loss function

[TABLE]

for all $t\in\mathbb{R}$ and a fixed $\tau\in(0,1)$ , see primarily [29] and also [19, 1] for further references. These expectiles have attracted considerable attention in recent years and have been applied successfully in many areas, for instance, in demography [31], in education [33] and extensively in finance [48, 23, 50, 25]. In fact, it has recently been shown (see, e.g. [6], [42]) that expectiles are the only risk measures that enjoy the properties of coherence and elicitability, see [21], and therefore they have been suggested as potentially better alternative to both Value at Risk (VaR) and Expected Shortfall (ES), see e.g. [46, 53, 6]. In order to see more applications of expectiles, we refer the interested readers to, e.g. [3, 34, 22].

Both quantiles and expectiles are special cases of so-called asymmetric $M$ -estimators (see [8]) and there exists one-to-one mapping between them (see, e.g. [19], [1] and [52] ), in general, however, expectiles do not coincide with quantiles. Hence, the choice between expectiles and quantiles mainly depends on the application at hand, as it is the case in the duality between the mean and the median. For example, if the goal is to estimate the (conditional) threshold for which $\tau$ -fraction of (conditional) observations lie below that threshold, then $\tau$ -quantile regression is the right choice. On the other hand, if one is interested to estimate the (conditional) threshold for which the average of below threshold excess information (deviations of observations from threshold) is $k$ times larger then above that threshold, then the $\tau$ -expectile regression is a preferable choice with $\tau=\frac{k}{k-1}$ , see [29, p. 823]. In other words, the focus in quantiles is the ordering of observations while expectiles account magnitude of the observations, which makes expectiles sensitive to the extreme values of the distribution and this sensitivity thus play a key role in computing the ES in finance. Since, estimating expectiles is computationally more efficient than quantiles, one can however use expectiles as a promising surrogate of quantiles in the situation where one is only interested to explore the conditional distribution.

As already mentioned above, $\tau$ -expectiles can be computed with the help of asymmetric risks

[TABLE]

where $P$ is the data generating distribution on $X\times Y$ and $f:X\to\mathbb{R}$ is some predictor. To be more precise, there exists a $P_{X}$ -almost surely unique function $f_{L_{\tau},P}^{\star}$ satisfying

[TABLE]

and $f_{L_{\tau},P}^{\star}(x)$ equals $\tau$ -expectile of the conditional distribution $P(\cdot|x)$ for $P_{X}$ -almost all $x\in X$ .

Some semiparametric and nonparametric methods for estimating conditional $\tau$ -expectiles with the help of empirical $L_{\tau}$ -risk have already been proposed in literature, see e.g. [32, 52, 51] for further details. Recently, [20] proposed an another nonparametric estimation method that belongs to the family of so-called kernel based regularized empirical risk minimization, which solves an optimization problem of the form

[TABLE]

Here, $\lambda>0$ is a user specified regularization parameter, $H$ is a reproducing kernel Hilbert space (RKHS) over $X$ with reproducing kernel $k$ (see, e.g. [4] and [37, Chapter 4.2]) and $\mathcal{R}_{L_{\tau},D}(f)$ denotes the empirical risk of $f$ , that is

[TABLE]

Since the ALS loss $L_{\tau}$ is convex, so is the optimization problem (3) and by [37, Lemma 5.1, Theorem 5.2] there always exits a unique $f_{D,\lambda}$ that satisfies (3). Moreover, the solution of $f_{D,\lambda}$ is of the form

[TABLE]

where $\alpha_{i}^{*}\geq 0,\beta_{i}^{*}\geq 0$ for all $i=i,\ldots,n$ , see [20] for further details. Learning method of the form (3) but with different loss functions have attracted many theoretical and algorithmic considerations, see for instance [49, 5, 9, 41, 17, 44] for least square regression, [38, 17] for quantile regression and [24, 40] for classification with hinge loss. In addition, [20] recently proposed an algorithm for solving (3), that is now a part of [43], and compared its performance to ER-Boost, see [51], which is another algorithm minimizing an empirical $L_{\tau}$ -risk. The main goal of this article is to complement the empirical findings of [20] with a detailed statistical analysis.

A typical way to access the quality of an estimator $f_{D}$ is to measure its distance to the target function $f_{L_{\tau},P}^{\star}$ , e.g. in terms of $\|f_{D}-f_{L_{\tau},P}^{\star}\|_{L_{2}(P_{X})}$ . For estimators obtained by some empirical risk minimization scheme, however, one can hardly ever estimate this $L_{2}$ -norm directly. Instead, the standard tools of statistical learning theory give bounds on the excess risk $\mathcal{R}_{L_{\tau},P}(f_{D})-\mathcal{R}_{L_{\tau},P}^{\star}$ . Therefore, our first goal of this paper is to establish a so-called calibration inequality that relates both quantities. To be more precise, we will show in Theorem 3 that

[TABLE]

holds for all $f_{D}\in L_{2}(P_{X})$ and some constant $c_{\tau}$ only depending on $\tau$ . In particular, (4) provides rates for $\|f_{D}-f_{L_{\tau},P}^{\star}\|_{L_{2}(P_{X})}$ as soon as we have established rates for $\mathcal{R}_{L_{\tau},P}(f_{D})-\mathcal{R}_{L_{\tau},P}^{\star}$ . Furthermore, it is common knowledge in statistical learning theory that bounds on $\mathcal{R}_{L_{\tau},P}(f_{D})-\mathcal{R}_{L_{\tau},P}^{\star}$ can be improved if so-called variance bounds are available. We will see in Lemma 4 that (4) leads to an optimal variance bound for $L_{\tau}$ whenever $Y$ is bounded. Note that both (4) and the variance bound are independent of the considered expectile estimation method. In fact, both results are key ingredients for the statistical analysis of any expectile estimation method based on some form of empirical risk minimization.

As already indicated above, however, the main goal of this paper is to provide a statistical analysis of the SVM-type estimator $f_{D,\lambda}$ given by (3). Since $2L_{1/2}$ equals the least squares loss, any statistical analysis of (3) also provides results for SVMs using the least squares loss. The latter have already been extensively investigated in the literature. For example, learning rates for generic kernels can be found in [12, 13, 9, 41, 28] and the references therein. Among these articles, only [12, 41, 28] obtain learning rates in minimax sense under some specific assumptions. For example, [12] assumes that the target function $f_{L_{1/2},P}^{\star}\in H$ , while [41, 28] establish optimal learning rates for the case in which $H$ does not contain the target function. In addition, [17] has recently established (essentially) asymptotically optimal learning rates for least square SVMs using Gaussian RBF kernels under the assumption that the target function $f_{L_{1/2},P}^{\star}$ is contained in some Sobolev or Besov space $B_{2,\infty}^{\alpha}$ with smoothness index $\alpha$ . A key ingredient of this work is to control the capacity of RKHS $H_{\gamma}(X)$ for Gaussian RBF kernel $k_{\gamma}$ on the closed unit Euclidean ball $X\subset\mathbb{R}^{d}$ by an entropy number bound

[TABLE]

see [37, Theorem 6.27], which holds for all $\gamma\in(0,1]$ and $p\in(0,1]$ . Unfortunately, the constant $c_{p,d}(X)$ derived from [37, Theorem 6.27] depends on $p$ in an unknown manner. As a consequence, [17] were only able to show learning rates of the form

[TABLE]

for all $\xi>0$ . To address this issue, we use [47, Lemma 4.5] to derive the following new entropy number bound

[TABLE]

which holds for all $p\in(0,1]$ and $\gamma\in(0,1]$ and some constant $K$ only depending on $d$ . In other words, we establish an upper bound for $c_{p,d}(X)$ whose dependence on $p$ is explicitly known. Using this new bound, we are then able to find improved learning rates of the form

[TABLE]

Clearly these new rates replace the nuisance factor $n^{\xi}$ of [17] by some logarithmic term, and up to this logarithmic factor our new rates are minimax optimal, see [17] for details. In addition, our new rates also hold for $\tau\neq 1/2$ , that is for general expectiles.

The rest of this paper is organized as follows. In Section 2, some properties of the ALS loss function are established including the self-calibration inequality and variance bound. Section 3 presents oracle inequalities and learning rates for (3) and Gaussian RBF kernels. The proofs of our results can be found in Section 4.

2 Properties of the ALS Loss Function: Self-Calibration and Variance Bounds

This section contains some properties of the ALS loss function i.e. convexity, local Lipschitz continuity, a self-calibration inequality, a supremum bound and a variance bound. Throughout this section, we assume that $X$ is an arbitrary, non-empty set equipped with $\sigma$ -algebra, and $Y\subset\mathbb{R}$ denotes a closed non-empty set. In addition, we assume that $\mathrm{P}$ is the probability distribution on $X\times Y$ , $P(\cdot|x)$ is a regular conditional probability distribution on $Y$ given $x\in X$ and $Q$ is a some distribution on $Y$ . Furthermore, $L_{\tau}:Y\times\mathbb{R}\to[0,\infty)$ is the ALS loss defined by (1) and $f:X\to\mathbb{R}$ is a measurable function. It is trivial to prove that $L_{\tau}$ is convex in $t$ , and this convexity ensures that the optimization problem (3) is efficiently solvable. Moreover, by [37, Lemma 2.13] convexity of $L_{\tau}$ implies convexity of corresponding risks. In the following, we present the idea of clipping to restrict the prediction $t$ to the domain $Y=[-M,M]$ where $M>0$ , see e.g. [37, Definition 2.22].

Definition 1.

We say that a loss $L:Y\times\mathbb{R}\to[0,\infty)$ can be clipped at $M>0$ , if, for all $(y,t)\in Y\times\mathbb{R}$ , we have

[TABLE]

where $\wideparen{t}$ denotes the clipped value of $t$ at $\pm M$ , that is

[TABLE]

Moreover, we say that $L$ can be clipped if $t$ can be clipped at some $M>0$ .

Recall that this clipping assumption has already been utilized while establishing learning rates for SVMs, see for instance [10, 39, 40] for hinge loss and [11, 38] for pinball loss. It is trivial to show by convexity of $L_{\tau}$ together with [37, Lemma 2.23] that $L_{\tau}$ can be clipped at $M$ and has at least one global minimizer in $[-M,M]$ . This also implies that $\mathcal{R}_{L_{\tau},P}(\wideparen{f})\leq\mathcal{R}_{L_{\tau},P}(f)$ for every $f:X\to\mathbb{R}$ . In other words, the clipping operation potentially reduces the risks. We therefore bound the risk $\mathcal{R}_{L_{\tau},P}(\wideparen{f}_{D,\lambda})$ of the clipped decision function rather than the risk $\mathcal{R}_{L_{\tau},P}(f_{D,\lambda})$ , which we will see in details in Section 3. From a practical point of view, this means that the training algorithm for (3) remains unchanged and the evaluation of the resulting decision function requires only a slight change. For further details on algorithmic advantages of clipping for SVMs using the hinge loss and the ALS loss, we refer the reader to [40] and [20] respectively. It is also observed in [37, 41, 17] that $\|\cdot\|_{\infty}$ -bounds, see Section 3, can be made smaller by clipping the decision function for some loss functions.

Let us further recall from [37, Definition 2.18] that a loss function is called locally Lipschitz continuous if for all $a\geq 0$ there exists a constant $c_{a}$ such that

[TABLE]

In the following we denote for a given $a\geq 0$ the smallest such constant $c_{a}$ by $|L|_{1,a}$ . The following lemma, which we will need for our proofs, shows that the ALS loss is locally Lipschitz continuous.

Lemma 2.

Let $Y\in[-M,M]$ and $t\in Y$ , then the loss function $L_{\tau}:Y\times\mathbb{R}\to[0,\infty)$ is locally Lipschitz continuous with Lipschitz constant

[TABLE]

where $C_{\tau}:=\max\{\tau,1-\tau\}$ .

For later use note that $L_{\tau}$ being locally Lipschitz continuous implies that $L_{\tau}$ is also a Nemitski loss in the sense of [37, Definition 18], and by [37, Lemma 2.13 and 2.19], this further implies that the corresponding risk $\mathcal{R}_{L_{\tau},P}(f)$ is convex and locally Lipschitz continuous.

Empirical methods of estimating expectile using $L_{\tau}$ loss typically lead to the function $f_{D}$ for which $\mathcal{R}_{L_{\tau},P}(f_{D})$ is close to $\mathcal{R}_{L_{\tau},P}^{\star}$ with high probability. The convexity of $L_{\tau}$ then ensures that $f_{D}$ approximates $f_{L_{\tau},P}^{\star}$ in a weak sense, namely in probability $P_{X}$ , see [35, Remark 3.18]. However, no guarantee on the speed of this convergence can be given, even if we know the convergence rate of $\mathcal{R}_{L_{\tau},P}(f_{D})\to\mathcal{R}_{L_{\tau},P}^{\star}$ . The following theorem addresses this issue by establishing a so-called calibration inequality for the excess $L_{\tau}$ -risk.

Theorem 3.

Let $L_{\tau}$ be the ALS loss function defined by (1) and $\mathrm{P}$ be the distribution on $\mathbb{R}$ . Moreover, assume that $f_{L_{\tau},P}^{\star}(x)<\infty$ is the conditional $\tau$ -expectile for fixed $\tau\in(0,1)$ . Then, for all $f:X\to\mathbb{R}$ , we have

[TABLE]

where $c_{\tau}:=\min\{\tau,1-\tau\}$ and $C_{\tau}$ is defined in Lemma 2.

Note that the calibration inequality, that is the right-hand side of the inequality above in particular ensures that $f_{D}\to f_{L_{\tau},P}^{\star}$ in $L_{2}(P_{X})$ whenever $\mathcal{R}_{L_{\tau},P}(f_{D})\to\mathcal{R}_{L_{\tau},P}^{\star}$ . In addition, the convergence rates can be directly translated. The inequality on the left shows that modulo constants the calibration inequality is sharp. We will use this left inequality when bounding the approximation error for Gaussian RBF kernels in the proof of Theorem 6.

At the end of this section, we present supremum and variance bounds of the $L_{\tau}$ -loss. Like the calibration inequality of Theorem 3 these two bounds are useful for analyzing the statistical properties of any $L_{\tau}$ -based empirical risk minimization scheme. In Section 3 we will illustrate this when establishing an oracle inequality for the SVM-type learning algorithm (3).

Lemma 4.

Let $X\subset\mathbb{R}^{d}$ be non-empty set, $Y\subset[-M,M]$ be a closed subset where $M>0$ , and $\mathrm{P}$ be a distribution on $X\times Y$ . Additionally, we assume that $L_{\tau}:Y\times\mathbb{R}\to[0,\infty)$ is the ALS loss and $f_{L_{\tau},P}^{\star}(x)$ is the conditional $\tau$ -expectile for fixed $\tau\in(0,1)$ . Then for all $f:X\to[-M,M]$ we have

i)

$\|L_{\tau}\circ f-L_{\tau}\circ f_{L_{\tau},P}^{\star}\|_{\infty}\leq 4\,C_{\tau}\,M^{2}\,.$ **

ii)

$\mathbb{E}_{P}(L_{\tau}\circ f-L_{\tau}\circ f_{L_{\tau},P}^{\star})^{2}\leq 16\,C_{\tau}^{2}\,c_{\tau}^{-1}\,M^{2}(\mathcal{R}_{L_{\tau},P}(f)-\mathcal{R}_{L_{\tau},P}^{\star})\,.$ **

3 Oracle Inequalities and Learning Rates

In this section, we first introduce some notions related to kernels. We assume that $k:X\times X\to\mathbb{R}$ is a measurable, symmetric and positive definite kernel with associated RKHS $H$ . Additionally, we assume that $k$ is bounded, that is, $\|k\|_{\infty}:=\sup_{x\in X}\sqrt{k(x,x)}\leq 1$ , which implies that $H$ consists of bounded functions with $\|f\|_{\infty}\leq\|k\|_{\infty}\|f\|_{H}$ for all $f\in H$ . In practice, we often consider SVMs that are equipped with well-known Gaussian RBF kernels for input domain $X\in\mathbb{R}^{d}$ , see [40, 20]. Recall that the latter are defined by

[TABLE]

where $\gamma$ is called the width parameter that is usually determined in a data dependent way, i.e. by cross validation. By [37, Corollary 4.58] the kernel $k_{\gamma}$ is universal on every compact set $X\in\mathbb{R}^{n}$ and in particular strictly positive definite. In addition, the RKHS $H_{\gamma}$ of kernel $k_{\gamma}$ is dense in $L_{p}(\mu)$ for all $p\in[1,\infty)$ and all distributions $\mu$ on $X$ , see [37, Proposition 4.60].

One requirement to establish learning rates is to control the capacity of RKHS $H$ . One way to do this is to estimate eigenvalues of a linear operator induced by kernel $k$ . To be more precise, given a kernel $k$ and a distribution $\mu$ on $X$ , we define the integral operator $T_{k}:L_{2}(\mu)\to L_{2}(\mu)$ by

[TABLE]

for $\mu$ -almost all $x\in X$ . In the following, we assume that $\mu=\mathrm{P}_{X}$ . Recall [37, Theorem 4.27] that $T_{k}$ is compact, positive, self-adjoint and nuclear, and thus has at most countably many non-zero (and non-negative) eigenvalues $\lambda_{i}(T_{k})$ . Ordering these eigenvalues (with geometric multiplicities) and extending the corresponding sequence by zeros, if there are only finitely many non-zero eigenvalues, we obtain the extended sequence of eigenvalues $(\lambda_{i}(T_{k}))_{i\geq 1}$ that satisfies $\sum_{i=1}^{\infty}\lambda_{i}(T_{k})<\infty$ [37, Theorem 7.29]. This summability implies that for some constant $a>1$ and $i\geq 1$ , we have $\lambda_{i}(T_{k})\leq ai^{-1}$ . By [41], this eigenvalues assumption can converge even faster to zero, that is, for $p\in(0,1)$ , we have

[TABLE]

It turns out that the speed of convergence of $\lambda_{i}(T_{k})$ influences learning rates for SVMs. For instance, [7] used (9) to establish learning rates for SVMs using hinge loss and [9, 28] for SVMs using least square loss.

Another way to control the capacity of RKHS $H$ is based on the concept of covering numbers or the inverse of covering numbers, namely, entropy numbers. To recall the latter, see [37, Definition A.5.26], let $T:E\to F$ be a bounded, linear operator between the Banach spaces $E$ and $F$ , and $i\geq 1$ be an integer. Then the $i$ -th (dyadic) entropy number of $T$ is defined by

[TABLE]

In the Hilbert space case, the eigenvalues and entropy number decay are closely related. For example, [36] showed that (9) is equivalent (modulo a constant only depending on $p$ ) to

[TABLE]

It is further shown in [36] that (10) implies a bound on average entropy numbers, that is, for empirical distribution associated to the data set $D_{X}:=(x_{1},\cdots,x_{n})\in X^{n}$ , the average entropy number is

[TABLE]

which is used in [37, Theorem 7.24] to establish the general oracle inequality for SVMs. A bound of the form (10) was also established by [37, Theorem 6.27] for Gaussian RBF kernels and certain distributions $P_{X}$ having unbounded support. To be more precise, let $X\subset\mathbb{R}^{d}$ be a closed unit Euclidean ball. Then for all $\gamma\in(0,1]$ and $p\in(0,1)$ , there exists a constant $c_{p,d}(X)$ such that

[TABLE]

which has been used by [17] to establish leaning rates for least square SVMs. Note that the constant $c_{p,d}(X)$ depends on $p$ in an unknown manner. To address this issue, we use [47, lemma 4.5] and derive an improved entropy number bound in the following theorem by establishing an upper bound for $c_{p,d}(X)$ whose dependence on $p$ is explicitly known. We will further see in Corollary 8 that this improved bound leads us to achieve better learning rates than the one obtained by [17].

Theorem 5.

Let $X=\mathbb{R}^{d}$ be a closed Euclidean ball. Then there exists a constant $K>0$ , such that, for all $p\in(0,1)$ , $\gamma\in(0,1]$ and $i\geq 1$ , we have

[TABLE]

Another requirement for establishing learning rates is to bound the approximation error function considering RKHS $H_{\gamma}$ for Gaussian RBF kernel $k_{\gamma}$ . If the distribution $P$ is such that $\mathcal{R}_{L_{\tau},P}^{\star}<\infty$ , then the approximation error function $\mathcal{A}:[0,\infty)\to[0,\infty)$ is defined by

[TABLE]

For $\lambda>0$ , the approximation error function $\mathcal{A}(\lambda)$ quantifies how well an infinite sample $L_{2}$ -SVM with RKHS $H_{\gamma}$ , that is, $\lambda\|f\|_{H_{\gamma}}^{2}+\mathcal{R}_{L_{\tau},P}(f)$ approximates the optimal risk $\mathcal{R}_{L_{\tau},P}^{\star}$ . By [37, Lemma 5.15], one can show that $\lim_{\lambda\to 0}\mathcal{A}(\lambda)=0$ if $H_{\gamma}$ is dense in $L_{2}(P_{X})$ . In general, however, the speed of convergence can not be faster than $O(\lambda)$ and this rate is achieved, if and only if, there exists an $f\in H_{\gamma}$ such that $\mathcal{R}_{L_{\tau},P}(f)=\mathcal{R}_{L_{\tau},P}^{\star}$ , see [37, Lemma 5.18].

In order to bound $\mathcal{A}(\lambda)$ , we first need to know one important feature of the target function $f_{L_{\tau},P}^{\star}$ , namely, the regularity which, roughly speaking, measures the smoothness of the target function. Different function spaces norms e.g. Hölder norms, Besov norms or Triebel-Lizorkin norms can be used to capture this regularity. In this work, following [17, 27], we assume that the target function $f_{L_{\tau},P}^{\star}$ is in a Sobolev or a Besov space. Recall [45, Definition 5.1] and [2, Definition 3.1 and 3.2] that for any integer $k\geq 0$ , $1\leq p\leq\infty$ and a subset $\Omega\subset\mathbb{R}^{d}$ with non-empty interior, the Sobolev space $W_{p}^{k}(\Omega)$ of order $k$ is defined by

[TABLE]

with the norm

[TABLE]

where $D^{(\alpha)}$ is the $\alpha$ -th weak partial derivative for multi-index $\alpha=(\alpha_{1},\ldots,\alpha_{d})\in\mathbb{N}_{0}^{d}$ of modulus $\lvert\alpha\rvert=\lvert\alpha_{1}\rvert+\cdots+\lvert\alpha_{d}\rvert$ . In other words, the Sobolev space is the space of functions with sufficiently many derivatives and equipped with a norm that measures both the size and the regularity of the contained functions. Note that $W_{p}^{k}(\Omega)$ is a Banach space, see [45, Lemma 5.2]. Moreover, by [2, Theorem 3.6], $W_{p}^{k}(\Omega)$ is separable if $p\in[1,\infty)$ , and is uniformly convex and reflexive if $p\in(1,\infty)$ . Furthermore, for $p=2$ , $W_{2}^{k}(\Omega)$ is a separable Hilbert space that we denote by $H_{k}(\Omega)$ . Despite the underlined advantages, Sobolev spaces can not be immediately applied when $\alpha$ is non-integral or when $p<1$ , however, the smoothness spaces for these extended parameters are also needed when engaging nonlinear approximation. This shortcoming of Sobolev spaces is covered by Besov spaces that bring together all functions for which the modulus of smoothness have a common behavior. Let us first recall [16, Section 2] and [15, Section 2] that for a subset $\Omega\subset\mathbb{R}^{d}$ with non-empty interior, a function $f:\Omega\to\mathbb{R}$ with $f\in L_{p}(\Omega)$ for all $p\in(0,\infty]$ and $s\in\mathbb{N}$ , the modulus of smoothness of order $s$ of a function $f$ is defined by

[TABLE]

where the $s$ -th difference $\triangle_{h}^{s}(f,\cdot)$ given by

[TABLE]

for $h\in\mathbb{R}^{d}$ , is used to measure the smoothness. Note that $w_{s,L_{p}(\Omega)}(f,t)\to 0$ as $t\to 0$ , which means that the faster this convergence to 0 the smoother is $f$ . For more details on properties of the modulus of smoothness, we refer the reader to [30, Chapter 4.2]. Now for $0<p,q\leq\infty$ , $\alpha>0$ , $s:=\lfloor\alpha\rfloor+1$ , the Besov space $B_{p,q}^{\alpha}(\Omega)$ based on modulus of smoothness for domain $\Omega\subset\mathbb{R}^{d}$ , see for instance [14, Section 4.5], [30, Chapter 4.3] and [16, Section 2], is defined by

[TABLE]

where the semi-norm $\lvert\cdot\rvert_{B_{p,q}^{\alpha}(\Omega)}$ is given by

[TABLE]

and for $q=\infty$ , the semi-norm $\lvert\cdot\rvert_{B_{p,q}^{\alpha}(\Omega)}$ is defined by

[TABLE]

In other words, Besov spaces are collections of functions $f$ with common smoothness. For more general definition of Besov-like spaces, we refer to [27, Section 4.1]. Note that $\|f\|_{B_{p,q}^{\alpha}(\Omega)}:=\|f\|_{L_{p}(\Omega)}+\lvert f\rvert_{B_{p,q}^{\alpha}(\Omega)}$ is the norm of $B_{p,q}^{\alpha}(\Omega)$ , see e.g. [16, Section 2] and [15, Section 2]. Furthermore, for $p>1$ different values of $s>\alpha$ give equivalent norms of $B_{p,q}^{\alpha}(\Omega)$ , which remains true for $p<1$ , see [16, Section 2]. It is well known, see e.g [30, Section 4.1], that $W_{p}^{s}(\Omega)\subset B_{p,\infty}^{s}(\Omega)$ for all $1\leq p\leq\infty$ , $p\neq 2$ , where for $p=q=2$ the Besov space is the same as the Sobolev space.

In the next step, we find a function $f_{0}\in H_{\gamma}$ such that both the regularization term $\lambda\|f_{0}\|_{H_{\gamma}}^{2}$ and the excess risk $\mathcal{R}_{L_{\tau},P}(f_{0})-\mathcal{R}_{L_{\tau},P}^{\star}$ are small. For this, we define the function $K_{\gamma}:\mathbb{R}^{d}\to\mathbb{R}$ , see [17], by

[TABLE]

for all $r\in\mathbb{N}$ , $\gamma>0$ and $x\in\mathbb{R}^{d}$ . Additionally, we assume that there exists a function $f_{L_{\tau},P}^{\star}:\mathbb{R}^{d}\to\mathbb{R}$ satisfies $f_{L_{\tau},P}^{\star}\in L_{2}(\mathbb{R}^{d})\cap L_{\infty}(\mathbb{R}^{d})$ and $\mathcal{R}_{L_{\tau},P}(f_{L_{\tau},P}^{\star})=\mathcal{R}_{L_{\tau},P}^{\star}$ . Then $f_{0}$ is defined by

[TABLE]

With these preparation, we now establish an upper bound for the approximate error function $\mathcal{A}(\lambda)$ .

Theorem 6.

Let $L_{\tau}$ be the ALS loss defined by (1), $\mathrm{P}$ be the probability distribution on $\mathbb{R}^{d}\times Y$ , and $P_{X}$ be the marginal distribution of $P$ onto $\mathbb{R}^{d}$ such that $X:=\mathrm{supp}\,\mathrm{P}_{X}$ and $P_{X}(\partial X)=0$ . Moreover, assume that the conditional $\tau$ -expectile $f_{L_{\tau},P}^{\star}$ satisfies $f_{L_{\tau},P}^{\star}\in L_{2}(\mathbb{R}^{d})\cap L_{\infty}(\mathbb{R}^{d})$ as well as $f_{L_{\tau},P}^{\star}\in B_{2,\infty}^{\alpha}(P_{X})$ for some $\alpha\geq 1$ . In addition, assume that $k_{\gamma}$ is the Gaussian RBF kernel over $X$ with associated RKHS $H_{\gamma}$ . Then for all $\gamma\in(0,1]$ and $\lambda>0$ , we have

[TABLE]

where $C_{\tau,s}>0$ is a constant depending on $s$ and $\tau$ , and the constant $C_{1}>0$ .

Clearly, the upper bound of the approximation error function in Theorem 6 depends on the regularization parameter $\lambda$ , the kernel width $\gamma$ , and the smoothness parameter $\alpha$ of the target function $f_{L_{\tau},P}^{\star}$ . Note that in order to shrink the right-hand side we need to let $\gamma\to 0$ . However, this would let the first term go to infinity unless we simultaneously let $\lambda\to 0$ with a sufficient speed. Now using [37, Theorem 7.24] together with Lemma 4, Theorem 6 and the entropy number bound (12), we establish oracle inequality of SVMs for $L_{\tau}$ in the following theorem.

Theorem 7.

Consider the assumptions of Theorem 6 and additionally assume that $Y:=[-M,M]$ for $M\geq 1$ . Then, for all $n\geq 1,\varrho\geq 1,\gamma\in(0,1)$ and $\lambda\in(0,e^{-2}]$ , the SVM using the RKHS $H_{\gamma}$ and the ALS loss function $L_{\tau}$ satisfies

[TABLE]

with probability $P^{n}$ not less than $1-3e^{-\varrho}$ . Here $C>0$ is some constant independent of $p,\lambda,\gamma,n$ and $\varrho$ .

It is well known that there exists a relationship between Sobolev spaces and the scale of Besov spaces, that is, $B_{p,u}^{\alpha}(\mathbb{R}^{d})\hookrightarrow W_{p}^{\alpha}(\mathbb{R}^{d})\hookrightarrow B_{p,v}^{\alpha}(\mathbb{R}^{d})$ , whenever $1\leq u\leq\min\{p,2\}$ and $\max\{p,2\}\leq v\leq\infty$ , see for instance [18, p.25 and p.44]. In particular, for $p=u=v=2$ , we have $W_{2}^{\alpha}(\mathbb{R}^{d})=B_{2,2}^{\alpha}(\mathbb{R}^{d})$ with equivalent norms. In addition, by [17, p.7] we have $B_{p,q}^{\alpha}(\mathbb{R}^{d})\subset B_{p,q}^{\alpha}(P_{X})$ . Thus, Theorem 7 also holds for decision functions $f_{L_{\tau},P}^{\star}:\mathbb{R}^{d}\to\mathbb{R}$ with $f_{L_{\tau},P}^{\star}\in L_{2}(\mathbb{R}^{d})\cap L_{\infty}(\mathbb{R}^{d})$ and $f_{L_{\tau},P}^{\star}\in W_{2}^{\alpha}(\mathbb{R}^{d})$ .

By assuming some suitable values for $\lambda$ and $\gamma$ that depends on data size $n$ , the smoothness parameter $\alpha$ , and the dimension $d$ , we obtain learning rates for learning problem (3) in the following corollary.

Corollary 8.

Under the assumptions of Theorem 7 and with

[TABLE]

where $c_{1}>0$ and $c_{2}>0$ are user specified constants, we have, for all $n\geq 1$ and $\varrho\geq 1$ ,

[TABLE]

with probability $P^{n}$ not less than $1-3e^{-\varrho}$ .

Note that learning rates in Corollary 8 depend on the choice of $\lambda_{n}$ and $\gamma_{n}$ , where the kernel width $\gamma_{n}$ requires knowing $\alpha$ which, in practice, is not available. However, [37, Chapter 7.4], [41], [17] and [38] showed that one can achieve the same learning rates adaptively, i.e. without knowing $\alpha$ . Let us recall [37, Definition 6.28] that describes a method to select $\lambda$ and $\gamma$ , which in some sense is a simplification of the cross-validation method.

Definition 9.

Let $H_{\gamma}$ be a RKHS over $X$ and $\Lambda:=(\Lambda_{n})$ and $\Gamma:=(\Gamma_{n})$ be the sequences of finite subsets $\Lambda_{n},\Gamma_{n}\subset(0,1]$ . Given a data set $D:=((x_{1},y_{1}),\ldots,(x_{n},y_{n}))\in(X\times\mathbb{R})^{n}$ , we define

[TABLE]

where $m=\lfloor\frac{n}{2}\rfloor+1$ and $n\geq 4$ . Then use $D_{1}$ as a training set to compute the SVM decision function

[TABLE]

and use $D_{2}$ to determine $(\lambda,\gamma)$ by choosing $(\lambda_{D_{2}},\gamma_{D_{2}})\in(\Lambda_{n},\Gamma_{n})$ such that

[TABLE]

Every learning method that produce the resulting decision functions $\wideparen{f}_{D_{1},\lambda_{D_{2}},\gamma_{D_{2}}}$ is called a training validation SVM with respect to $(\Lambda,\Gamma)$ .

In the next Theorem, we use this training-validation SVM (TV-SVM) approach for suitable candidate sets $\Lambda_{n}:=(\lambda_{1},\ldots,\lambda_{r})$ and $\Gamma_{n}:=(\gamma_{1},\ldots,\gamma_{s})$ with $\lambda_{r}=\gamma_{s}=1$ , and establish learning rates similar to (16).

Theorem 10.

With the assumptions of Theorem 7, let $\Lambda:=(\Lambda_{n})$ and $\Gamma:=(\Gamma_{n})$ be the sequences of finite subsets $\Lambda_{n},\Gamma_{n}\subset(0,1]$ such that $\Lambda_{n}$ is an $n^{-1}$ -net of $(0,1]$ and $\Gamma_{n}$ is an $n^{-\frac{1}{2\alpha+d}}$ -net of $(0,1]$ with polynomially growing cardinalities $|\Lambda_{n}|$ and $|\Gamma_{n}|$ in $n$ . Then for all $\varrho\geq 1$ , the TV-SVM produce $f_{D_{1},\lambda_{D_{2}},\gamma_{D_{2}}}$ that satisfies

[TABLE]

where $C>0$ is a constant independent of $n$ and $\varrho$ .

So far we have only considered the case of bounded noise with known bounds, that is, $Y\in[-M,M]$ where $M>0$ is known. In practice, $M$ is usually unknown and in this situation, one can still achieve the same learning rates by simply increasing $M$ slowly. However, more interesting is the case of unbounded noise. In the following we treat this case for distributions for which there exist constants $c\geq 1$ and $l>0$ such that

[TABLE]

for all $\varrho>1$ . In other words, the tails of the response variable $Y$ decay sufficiently fast. It is shown in [17] by examples that such an assumption is realistic. For instance, if $P(.|x)\sim N(\mu(x),1)$ , the assumption (17) is satisfied for $l=\frac{1}{2}$ , see [17, Example 3.7], and for the case where $P(.|x)$ has the density whose tails decay like $e^{-\lvert t\rvert}$ , the assumption (17) holds for $l=1$ , see [17, Example 3.8].

With this additional assumption, we present learning rates for the case of unbounded noise in the following theorem.

Theorem 11.

Let $Y\subset\mathbb{R}$ and $\mathrm{P}$ be a probability distribution on $\mathbb{R}^{d}\times Y$ such that $X:=\mathrm{supp}\,\mathrm{P}_{X}\subset B_{l_{2}^{d}}$ . Moreover, assume that the $\tau$ -expectile $f_{L_{\tau},P}^{\star}$ satisfies $f_{L_{\tau},P}^{\star}(x)\in[-1,1]$ for $\mathrm{P}_{X}$ -almost all $x\in X$ , and both $f_{L_{\tau},P}^{\star}\in L_{2}(\mathbb{R}^{d})\cap L_{\infty}(\mathbb{R}^{d})$ and $f_{L_{\tau},P}^{\star}\in B_{2,\infty}^{\alpha}(P_{X})$ for some $\alpha\geq 1$ . In addition, assume that (17) holds for all $\varrho\geq 1$ . We define

[TABLE]

where $c_{1}>0$ and $c_{2}>0$ are user-specified constants. Moreover, for some fixed $\hat{\varrho}\geq 1$ and $n\geq 3$ we define $\varrho:=\hat{\varrho}+\ln n$ and $M_{n}:=2c\varrho^{l}$ . Furthermore, we consider the SVM that clips decision function $f_{D,\lambda_{n},\gamma_{n}}$ at $M_{n}$ after training. Then there exists a $C>0$ independent of $n$ , $p$ and $\hat{\varrho}$ such that

[TABLE]

holds with probability $P^{n}$ not less than $1-2e^{-\hat{\varrho}}$ .

Note that the assumption (17) on the tail of the distribution does not influence learning rates achieved in the Corollary 8. Furthermore, we can also achieve same rates adaptively using TV-SVM approach considered in Theorem 10 provided that we have upper bound of the unknown parameter $l$ , which depends on the distribution $P$ , see [17] where this dependency is explained with some examples.

Let us now compare our results with the oracle inequalities and learning rates established by [17] for least square SVMs. This comparison is justifiable because a) the least square loss is a special case of $L_{\tau}$ -loss for $\tau=0.5$ , b) the target function $f_{L_{\tau},P}^{\star}$ is assumed to be in the Sobolev or Besov space similar to [17], and c) the supremum and the variance bounds for $L_{\tau}$ with $\tau=0.5$ are the same as the ones used by [17]. Furthermore, recall that [17] used the entropy number bounds (11) to control the capacity of the RKHS $H_{\gamma}$ which contains a constant $c_{p,d}(X)$ depending on $p$ in an unknown manner. As a result, they obtained a leading constant $C$ in their oracle inequality, see [17, Theorem 3.1] for which no upper bound can be determined explicitly. We cope this problem by establishing an improved entropy number bound (12) which not only provides the upper bound for $c_{p,d}(X)$ but also helps to determine the value of the constant $C$ in the oracle inequality (15) explicitly. As a consequence we can improve their learning rates of the form $n^{-\frac{2\alpha}{2\alpha+d}+\xi}\,$ , where $\xi>0$ , by

[TABLE]

In other words, the nuisance parameter $n^{\xi}$ from [17] is replaced by the logarithmic term $(\log n)^{d+1}$ . Moreover, our learning rates, up to this logarithmic term, are minimax optimal, see e.g. the discussion in [17]. Finally note that unlike [17] we have not only established learning rates for the least squares case $\tau=0.5$ but actually for all $\tau\in(0,1)$ .

4 Proofs

4.1 Proofs of Section 2

Proof of Lemma 2. We define $\psi:\mathbb{R}\to\mathbb{R}$ by

[TABLE]

Clearly, $\psi$ is convex and thus [37, Lemma A.6.5] shows that $\psi$ is locally Lipschitz continuous. Moreover, we have

[TABLE]

where $C_{\tau}:=\max\{\tau,1-\tau\}$ . A simple consideration shows that this estimate is also sharp. ∎

In order to prove Theorem 3 recall that the risk $\mathcal{R}_{L_{\tau},P}(f)$ in (2) uses regular conditional probability $P(y|x)$ , which enable us to computed $\mathcal{R}_{L_{\tau},P}(f)$ by treating the inner and the outer integrals separately. Following [37, Definition 3.3, Definition 3.4], we therefore use inner $L_{\tau}$ -risks as a key ingredient for establishing self-calibration inequalities.

Definition 12.

Let $L_{\tau}:Y\times\mathbb{R}\to[0,\infty)$ be the ALS loss function defined by (1) and $Q$ be a distribution on $Y=[-M,M]$ . Then the inner $L_{\tau}$ -risks of $Q$ are defined by

[TABLE]

and the minimal inner $L_{\tau}$ -risk is

[TABLE]

In the latter definition, the inner risks $\mathcal{C}_{L_{\tau},Q}(\cdot)$ for a suitable classes of distributions $Q$ on $Y$ are considered as a template for $\mathcal{C}_{L_{\tau},P(\cdot|x)}(\cdot)$ . From this, we immediately can obtain the risk of function $f$ , i.e. $\mathcal{R}_{L_{\tau},P}(f)=\int_{X}\mathcal{C}_{L_{\tau},P(\cdot|x)}(f(x)dP_{X}(x)$ . Moreover, by [37, Lemma 3.4], the optimal risk $\mathcal{R}_{L_{\tau},P}^{\star}$ can be obtained by minimizing the inner $L_{\tau}$ -risks, i.e. $\mathcal{R}_{L_{\tau},P}^{\star}=\int_{X}\mathcal{C}_{L_{\tau},P(\cdotp|x)}^{\star}dP_{X}(x)$ . consequently, the excess $L_{\tau}$ -risk, when $\mathcal{R}_{L_{\tau},P}^{\star}<\infty$ , is obtained by

[TABLE]

Besides some technical advantages, this approach makes the analysis rather independent of the specific distribution $\mathrm{P}$ . In the following theorem, we use this approach and establish the lower and the upper bound of excess inner $L_{\tau}$ -risks.

Theorem 13.

Let $L_{\tau}$ be the ALS loss function defined by (1) and $Q$ be a distribution on $\mathbb{R}$ with $\mathcal{C}_{L_{\tau},Q}^{\star}<\infty$ . For a fixed $\tau\in(0,1)$ and for all $t\in\mathbb{R}$ , we have

[TABLE]

where $c_{\tau}:=\min\{\tau,1-\tau\}$ and $C_{\tau}$ is defined in Lemma 2.

Proof of Theorem 13. Let us fix $\tau\in(0,1)$ . Then for a distribution $Q$ on $\mathbb{R}$ satisfies $\mathcal{C}_{L_{\tau},Q}^{\star}<\infty$ , the $\tau$ -expectile $t^{\star}$ , according to [29], is the only solution of

[TABLE]

Let us now compute the excess inner risks of $L_{\tau}$ with respect to $Q$ . To this end, we fix a $t\geq t^{\star}$ . Then we have

[TABLE]

and

[TABLE]

By Definition 12 and using (22), we obtain

[TABLE]

and this leads to the following excess inner $L_{\tau}$ -risk

[TABLE]

Let us define $c_{\tau}:=\min\{\tau,1-\tau\}$ , then (23) leads to the following lower bound of excess inner $L_{\tau}$ -risk when $t\geq t^{\star}$ :

[TABLE]

Likewise, the excess inner $L_{\tau}$ -risk when $t<t^{\star}$ is

[TABLE]

that also leads to the lower bound (24). Now, for the proof of upper bound of the excess inner $L_{\tau}$ -risks, we define $C_{\tau}:=\max\{\tau,1-\tau\}$ . Then (23) leads to the following upper bound of excess inner $L_{\tau}$ -risks when $t\geq t^{\star}$ :

[TABLE]

Analogously, for the case of $t<t^{\star}$ , (25) also leads to the upper bound (26) for excess inner $L_{\tau}$ -risks. ∎

Proof of Theorem 3. For a fixed $x\in X$ , we write $t:=f(x)$ and $t^{\star}:=f_{L_{\tau},P}^{\star}(x)$ . By Theorem 13, for $Q:=P(\cdot|x)$ , we then immediately obtain

[TABLE]

Integrating with respect to $P_{X}$ leads to the assertion. ∎

Proof of Lemma 4. i) Since $L_{\tau}$ can be clipped at $M$ and the conditional $\tau$ -expectile satisfies $f_{L_{\tau},P}^{\star}(x)\in[-M,M]$ almost surely. Then

[TABLE]

for all $f:X\to[-M,M]$ and all $(x,y)\in X\times Y$ .

ii) Using the locally Lipschitz continuity of the loss $L_{\tau}$ and Theorem 3, we obtain

[TABLE]

∎

4.2 Proofs of Section 3

Proof of Theorem 5. By [47, Lemma 4.5], the $\|\cdot\|_{\infty}$ -log covering numbers of unit ball $B_{\gamma}(X)$ of the Gaussian RKHS $H_{\gamma}(X)$ for all $\gamma\in(0,1)$ and $\varepsilon\in(0,\frac{1}{2})$ satisfy

[TABLE]

where $K>0$ is a constant depending only on $d$ . From this, we conclude that

[TABLE]

Let $h(\varepsilon):=\varepsilon^{p}\left(\log\frac{1}{\varepsilon}\right)^{d+1}$ . In order to obtain the optimal value of $h(\varepsilon)$ , we differentiate it with respect to $\varepsilon$

[TABLE]

and set $\frac{dh(\varepsilon)}{d\varepsilon}=0$ which gives

[TABLE]

By plugging $\varepsilon^{*}$ into $h(\varepsilon)$ , we obtain

[TABLE]

and consequently, $\|\cdot\|_{\infty}$ -log covering numbers (27) are

[TABLE]

where $a:=K\left(\frac{d+1}{ep}\right)^{d+1}\gamma^{-d}$ . Now, by inverse implication of [37, Lemma 6.21], see also [37, Exercise 6.8], the bound on entropy number of the Gaussian RBF kernel is

[TABLE]

for all $i\geq 1$ , $\gamma\in(0,1)$ . ∎

Proof of Theorem 6. The assumption $f_{L_{\tau},P}^{\star}\in L_{2}(\mathbb{R}^{d})$ and [17, Theorem 2.3] immediately yield that $f_{0}:=K*f_{L_{\tau},P}^{\star}\in H_{\gamma}$ , i.e. $f_{0}$ is contained in RKHS $H_{\gamma}$ . Furthermore, [17, Theorem 2.3] leads to the following upper bound of the regularization term

[TABLE]

In the next step, we bound the excess risk. By [17, Theorem 2.2], the upper bound for $L_{2}(P_{X})$ -distance between $f_{0}$ and $f_{L_{\tau},P}^{\star}$ is

[TABLE]

where $C_{s,2}:=:=\sum_{i=0}^{\lceil 2s\rceil}\binom{\lceil 2s\rceil}{i}(2d)^{\frac{i}{2}}\prod_{j=1}^{i}(j-\frac{1}{2})^{\frac{1}{2}}$ , see [17, p.27], is constant only depending on $s$ and $g\in L_{2}(\mathbb{R}^{d})$ is the Lebesgue density. Now using Theorem 13 together with (28), we obtain

[TABLE]

where $C_{\tau,s}:=c^{2}\,C_{\tau}\,C_{s,2}\,\|g\|_{L_{2}(\mathbb{R}^{d})}$ . With these results, we finally obtain

[TABLE]

where $C_{1}:=(\sqrt{\pi})^{-d}(2^{r}-1)^{2}\|f_{L_{\tau},P}^{\star}\|_{L_{2}(\mathbb{R}^{d})}^{2}$ . ∎

In order to prove the main oracle inequality given in Theorem 7, we need the following lemma.

Lemma 14.

The function $h:(0,\frac{1}{2}]\to\mathbb{R}$ defined by

[TABLE]

is convex. Moreover, we have $\sup_{p\in(0,\frac{1}{2}]}h(p)=1$ .

Proof.

By considering the linear transformation $t:=2p$ , it is suffices to show that the function $g:(0,1]\to\mathbb{R}$ defined by

[TABLE]

is convex. To solve the latter, we first compute the first and second derivative of $g(t)$ with respect to $t$ , that is:

[TABLE]

and

[TABLE]

Since $t\in(0,1]$ , it is not hard to see that all terms in $g^{\prime\prime}(t)$ are strictly positive. Thus $g^{\prime\prime}(t)>0$ and hence $g(t)$ is convex. Furthermore, by convexity of $g(t)$ , it is easy to find that

[TABLE]

∎

Proof of Theorem 7. The assumption $f_{L_{\tau},P}^{\star}\in L_{\infty}(\mathbb{R}^{d})$ and [17, Theorem 2.3] yield that

[TABLE]

holds for all $x\in X$ . This implies that, for all $(x,y)\in X\times Y$ , we have

[TABLE]

and hence we conclude that $B_{0}\geq 4M^{2}$ . Now, by plugging the result of Theorem 6 together with $a=(3K)^{\frac{1}{2p}}\Big{(}\frac{d+1}{ep}\Big{)}^{\frac{d+1}{2p}}$ from Theorem 5 and $V=16c_{\tau}^{-1}\,M^{2}$ from Lemma 4, into [37, Theorem 7.23], we obtain

[TABLE]

where $C_{1}$ and $C_{\tau,s}$ are from Theorem 6, $K(p)$ is a constant from [37, Theorem 7.23] that depends on $p$ , $C_{2}:=3456\,M^{2}\,C_{\tau}^{2}\,c_{\tau}^{-1}+60(M+2^{s}\|f_{L_{\tau},P}^{\star}\|_{L_{\infty}(\mathbb{R}^{d})})^{2}$ , and $C_{d}:=3K\Big{(}\frac{d+1}{e}\Big{)}^{d+1}$ is a constant only depending on $d$ . Let us assume that $p:=\frac{1}{\log\lambda^{-1}}$ . Since $\lambda\leq e^{-2}$ and $\lambda^{p}=e^{-1}$ ,thus (4.2) becomes

[TABLE]

We now consider the constant $K(p)$ in more detail. To this end, by using the Lipschitz constant $\lvert L_{\tau}\rvert_{1,M}=4M$ from Lemma 2 and the supremum bound $B=4M^{2}$ from Lemma 4 , the value of $K(p)$ is, see [37, Theorem 7.23]:

[TABLE]

where the constants $C_{1}(p)$ and $C_{2}(p)$ are derived in the proof of [37, Theorem 7.16], that is

[TABLE]

and by [37, Lemma 7.15], we have

[TABLE]

Here we are interested to bound $K(p)$ for $p\in(0,\frac{1}{2}]$ . For this, we first need to bound the constants $C_{1}(p)$ and $C_{2}(p)$ . We start with $C_{p}$ and obtain the following bound for $p\in(0,\frac{1}{2}]$ .

[TABLE]

where we used $\Big{(}\frac{1-p}{p}\Big{)}^{p}=\Big{(}\frac{1}{p}-1\Big{)}^{p}\leq e$ for all $p\in(0,\frac{1}{2}]$ , and Lemma 14. Now the bound for $C_{1}(p)$ is the following:

[TABLE]

Analogously, the bound for the constant $C_{2}(p)$ is:

[TABLE]

By plugging $C_{1}(p)$ and $C_{2}(p)$ into (4.2), we thus obtain

[TABLE]

and by plugging this result into (31), we obtain

[TABLE]

where $C$ is a constant independent of $p,\lambda,\gamma,n$ and $\varrho$ . ∎

Proof of Corollary 8. For all $n\geq 1$ , Theorem 7 yields

[TABLE]

with probability $P^{n}$ not less than $1-3e^{-\varrho}$ and a constant $c>0$ . Using the sequences $\lambda_{n}=c_{1}n^{-1}$ and $\gamma_{n}=c_{2}n^{-\frac{1}{2\alpha+d}}$ , we obtain

[TABLE]

where the positive constant $\tilde{C}:=C(c_{1}c_{2}^{-d}+c_{2}^{2\alpha}+c_{2}^{-d}+1)$ is independent of $p$ . ∎

Before we can proof the Theorem 10, we need the following technical lemma.

Lemma 15.

Let $c\geq 3$ , $n\geq 3$ be a constant, $\Lambda_{n}\subset(0,1]$ be a finite set such that there exists a $\lambda_{i}\in\Lambda_{n}$ with $\frac{1}{c}n^{-1}\leq\lambda_{i}\leq cn^{-1}$ . Moreover assume that $\delta_{n}\geq 0$ and $\Gamma_{n}\subset(0,1]$ is a finite $\delta_{n}$ -net of $(0,1]$ . Then for $d>0$ and $\alpha>0$ we have

[TABLE]

where $c$ is a constant independent of $n,\delta_{n},\Lambda_{n},\Gamma_{n}$ .

Proof.

Let us assume that $\Lambda_{n}=\{\lambda_{1},\ldots,\lambda_{r}\}$ and $\Gamma_{n}=\{\gamma_{1},\ldots,\gamma_{s}\}$ , and $\lambda_{i-1}<\lambda_{i}$ for all $i=2,\ldots,r$ and $\gamma_{j-1}<\gamma_{j}$ for all $j=2,\ldots,s$ . We thus obtain

[TABLE]

where $\tilde{c}:=c+(2\log c)^{d+1}$ . It is not hard to see that the function $\gamma\mapsto\gamma^{-d}n^{-1}+\gamma^{2\alpha}$ is optimal at $\gamma^{*}:=c_{1}n^{-\frac{1}{2\alpha+d}}$ , where $c_{1}>0$ is a constant only depends on $\alpha$ and $d$ . Furthermore, with $\gamma_{0}=0$ , we see that $\gamma_{j}-\gamma_{j-1}\leq 2\delta_{n}$ for all $j=1,\ldots,s$ . In addition, there exits an index $j\in\{1,\ldots,s\}$ such that $\gamma_{j-1}\leq\gamma_{n}^{*}\leq\gamma_{j}$ . Consequently, we have $\gamma_{n}^{*}\leq\gamma_{j}\leq\gamma_{n}^{*}+2\delta_{n}$ . Using this result in (4.2), we obtain

[TABLE]

where $c:=\tilde{c}_{\alpha}(c_{1}^{-d}+c_{1}^{2\alpha})$ is a constant. ∎

Proof of Theorem 10. The proof of this theorem is the literal repetition of the proof of [17, Theorem 3.6 ], however, we present here for the sake of completeness. Let us define $m:=\lfloor\frac{n}{2}\rfloor+1\geq\frac{n}{2}$ , then for all $(\lambda,\gamma)\in\Lambda_{n}\times\Gamma_{n}$ , Theorem 7 yields

[TABLE]

with probability $P^{m}$ not less than $1-3|\Lambda_{n}\times\Gamma_{n}|\,e^{-\varrho}$ . Now define $n-m\geq\frac{n}{2}-1\geq\frac{n}{4}$ and $\varrho_{n}:=\varrho+\ln(1+|\Lambda_{n}\times\Gamma_{n}|)$ , then by using [37, Theorem 7.2] and Lemma 15, we obtain

[TABLE]

with probability $P^{n}$ not less than $1-3(1+\lvert\Lambda_{n}\times\Gamma_{n}\rvert)e^{-\varrho}$ . ∎

Proof of Theorem 11. By (17), we obtain

[TABLE]

This implies that

[TABLE]

This leads us to conclude with probability $P^{n}$ not less than $1-e^{-\hat{\varrho}}$ that the SVM for ALS loss with belatedly clipped decision function at $M_{n}$ is actually a clipped regularized empirical risk minimization (CR-ERM) in the sense of [37, Definition 7.18]. Consequently, [37, Theorem 7.20] holds for $\hat{Y}:=\{-M_{n},M_{n}\}$ modulo a set of probability $P^{n}$ not less than $1-e^{-\hat{\varrho}}$ . From Theorem 7, we then obtain

[TABLE]

with probability $P^{n}$ not less than $1-e^{-\bar{\varrho}}-e^{-\hat{\varrho}}$ . As in the proof of Corollary (8) and by using the inequality $(a+b)^{c}\leq(2ab)^{c}$ , for $a,b\geq 1$ and $c>0$ , we finally obtain

[TABLE]

for all $n\geq 3$ with probability $P^{n}$ not less than $1-e^{-\bar{\varrho}}-e^{-\hat{\varrho}}$ . Choosing $\bar{\varrho}=\hat{\varrho}$ leads to the assertion. ∎

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Abdous and B. Remillard. Relating quantiles and expectiles under weighted-symmetry. Ann. Inst. Statist. Math. , 47:371–384, 1995. http://dx.doi.org/10.1007/bf 00773468 . · doi ↗
2[2] R. A. Adams and J. J. F. Fournier. Sobolev Spaces . Academic Press, New York, 2nd edition, 2003. https://doi.org/10.1016/s 0079-8169(03)x 8001-0 . · doi ↗
3[3] Y. Aragon, S. Casanova, R. Chambers, and E. Leconte. Conditional ordering using nonparametric expectiles. J. Off. Stat. , 21:617–633, 2005. http://www.jos.nu/Articles/abstract.asp?article=214617 .
4[4] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc. , 68:337–404, 1950. .
5[5] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. J. complexity , 23:52–72, 2007. https://doi.org/10.1016/j.jco.2006.07.001 . · doi ↗
6[6] F. Bellini, B. Klar, A. Müller, and R. E. Gianin. Generalized quantiles as risk measures. Insurance Math. Econom. , 54:41–48, 2014. http://dx.doi.org/10.1016/j.insmatheco.2013.10.015 . · doi ↗
7[7] G. Blanchard, O. Bousquet, and P. Massart. Statistical performance of support vector machines. Ann. Statist. , pages 489–531, 2008. https://doi.org/10.1214/009053607000000839 . · doi ↗
8[8] J. Breckling and R. Chambers. M-quantiles. Biometrika , 75:761–771, 1988. http://dx.doi.org/10.2307/2336317 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Learning Rates for Kernel-Based Expectile Regression

Abstract

1 Introduction

2 Properties of the ALS Loss Function: Self-Calibration and Variance Bounds

Definition 1**.**

Lemma 2**.**

Theorem 3**.**

Lemma 4**.**

3 Oracle Inequalities and Learning Rates

Theorem 5**.**

Theorem 6**.**

Theorem 7**.**

Corollary 8**.**

Definition 9**.**

Theorem 10**.**

Theorem 11**.**

4 Proofs

4.1 Proofs of Section 2

Definition 12**.**

Theorem 13**.**

4.2 Proofs of Section 3

Lemma 14**.**

Proof.

Lemma 15**.**

Proof.

Definition 1.

Lemma 2.

Theorem 3.

Lemma 4.

Theorem 5.

Theorem 6.

Theorem 7.

Corollary 8.

Definition 9.

Theorem 10.

Theorem 11.

Definition 12.

Theorem 13.

Lemma 14.

Lemma 15.