Learning Rates of Regression with q-norm Loss and Threshold

Ting Hu; Yuan Yao

arXiv:1701.01956·math.ST·January 10, 2017

Learning Rates of Regression with q-norm Loss and Threshold

Ting Hu, Yuan Yao

PDF

Open Access

TL;DR

This paper investigates robust regression methods using q-norm loss functions within reproducing kernel Hilbert spaces, providing theoretical error bounds and learning rates under noise conditions.

Contribution

It introduces variance-expectation bounds for q-norm loss regression and derives explicit learning rates based on kernel approximation assumptions.

Findings

01

Established variance-expectation bounds under noise conditions

02

Derived explicit learning rates for q-norm loss regression

03

Provided theoretical error bounds in RKHS setting

Abstract

This paper studies some robust regression problems associated with the $q$ -norm loss ( $q \geq 1$ ) and the $ϵ$ -insensitive $q$ -norm loss in the reproducing kernel Hilbert space. We establish a variance-expectation bound under a priori noise condition on the conditional distribution, which is the key technique to measure the error bound. Explicit learning rates will be given under the approximation ability assumptions on the reproducing kernel Hilbert space.

Equations218

\psi^{\epsilon}(u)=\left\{\begin{array}[]{ll}|u|-\epsilon,&\hbox{ if}\ |u|>\epsilon,\\ 0,&\hbox{ if}\ |u|\leq\epsilon.\end{array}\right.

\psi^{\epsilon}(u)=\left\{\begin{array}[]{ll}|u|-\epsilon,&\hbox{ if}\ |u|>\epsilon,\\ 0,&\hbox{ if}\ |u|\leq\epsilon.\end{array}\right.

E (f) = \int_{Z} ψ_{q} (y - f (x)) d ρ .

E (f) = \int_{Z} ψ_{q} (y - f (x)) d ρ .

\psi^{\epsilon}_{q}(u)=\left\{\begin{array}[]{ll}(|u|-\epsilon)^{q},&\hbox{ if}\ |u|>\epsilon,\\ 0,&\hbox{ if}\ |u|\leq\epsilon.\end{array}\right.

\psi^{\epsilon}_{q}(u)=\left\{\begin{array}[]{ll}(|u|-\epsilon)^{q},&\hbox{ if}\ |u|>\epsilon,\\ 0,&\hbox{ if}\ |u|\leq\epsilon.\end{array}\right.

f_{{\bf z}}^{\epsilon}=\arg\min_{f\in{\cal H}_{K}}\big{\{}\frac{1}{T}\sum_{t=1}^{T}\psi^{\epsilon}_{q}(f(x_{t})-y_{t})+\lambda\|f\|^{2}_{K}\big{\}}.

f_{{\bf z}}^{\epsilon}=\arg\min_{f\in{\cal H}_{K}}\big{\{}\frac{1}{T}\sum_{t=1}^{T}\psi^{\epsilon}_{q}(f(x_{t})-y_{t})+\lambda\|f\|^{2}_{K}\big{\}}.

{\cal D}(\lambda)=\min_{f\in{\cal H}_{K}}\big{\{}{\cal E}(f)-{\cal E}(f_{q})+\lambda\|f\|_{K}^{2}\big{\}},\quad\lambda>0.

{\cal D}(\lambda)=\min_{f\in{\cal H}_{K}}\big{\{}{\cal E}(f)-{\cal E}(f_{q})+\lambda\|f\|_{K}^{2}\big{\}},\quad\lambda>0.

\displaystyle f_{\lambda}=\arg\min_{f\in{\cal H}_{K}}\big{\{}{\cal E}(f)-{\cal E}(f_{q})+\lambda\|f\|_{K}^{2}\big{\}}.

\displaystyle f_{\lambda}=\arg\min_{f\in{\cal H}_{K}}\big{\{}{\cal E}(f)-{\cal E}(f_{q})+\lambda\|f\|_{K}^{2}\big{\}}.

D (λ) \leq D_{0} λ^{β}, \forall λ > 0

D (λ) \leq D_{0} λ^{β}, \forall λ > 0

\pi(f(x))=\left\{\begin{array}[]{ll}1,&\hbox{if}\quad f(x)\geq 1,\\ f(x),&\hbox{if}\quad-1<f(x)<1,\\ -1,&\hbox{if}\quad f(x)\leq-1.\end{array}\right.

\pi(f(x))=\left\{\begin{array}[]{ll}1,&\hbox{if}\quad f(x)\geq 1,\\ f(x),&\hbox{if}\quad-1<f(x)<1,\\ -1,&\hbox{if}\quad f(x)\leq-1.\end{array}\right.

\frac{d\rho_{x}}{dy}(y)=\left\{\begin{array}[]{ll}A|y-f_{q}(x)|^{\varphi},&\hbox{if}\ |y-f_{q}(x)|\leq\frac{1}{4},\\ 0,&\hbox{otherwise,}\end{array}\right.

\frac{d\rho_{x}}{dy}(y)=\left\{\begin{array}[]{ll}A|y-f_{q}(x)|^{\varphi},&\hbox{if}\ |y-f_{q}(x)|\leq\frac{1}{4},\\ 0,&\hbox{otherwise,}\end{array}\right.

∥ π (f_{z}^{ϵ}) - f_{q} ∥_{L_{ρ_{X}}^{q + φ + 1}} \leq C^{'} lo g \frac{3}{δ} T^{- \frac{1}{2 ( q + φ )}},

∥ π (f_{z}^{ϵ}) - f_{q} ∥_{L_{ρ_{X}}^{q + φ + 1}} \leq C^{'} lo g \frac{3}{δ} T^{- \frac{1}{2 ( q + φ )}},

ρ_{x} ({y : f_{q} (x) \leq y \leq f_{q} (x) + s}) \geq b (x) s^{w}

ρ_{x} ({y : f_{q} (x) \leq y \leq f_{q} (x) + s}) \geq b (x) s^{w}

ρ_{x} ({y : f_{q} (x) - s \leq y \leq f_{q} (x)}) \geq b (x) s^{w} .

ρ_{x} ({y : f_{q} (x) - s \leq y \leq f_{q} (x)}) \geq b (x) s^{w} .

ρ_{x} ({y : f_{q} (x) \leq y \leq f_{q} (x) + s} = \frac{1}{2 π σ} \int_{f_{ρ} (x)}^{f_{ρ} (x) + s} exp {- \frac{( y - u _{x} ) ^{2}}{2 σ ^{2}}} d y

ρ_{x} ({y : f_{q} (x) \leq y \leq f_{q} (x) + s} = \frac{1}{2 π σ} \int_{f_{ρ} (x)}^{f_{ρ} (x) + s} exp {- \frac{( y - u _{x} ) ^{2}}{2 σ ^{2}}} d y

= \frac{1}{2 π σ} \int_{0}^{s} exp {- \frac{y ^{2}}{2 σ ^{2}}} d y \geq \frac{1}{2 π σ} \int_{0}^{s} exp {- \frac{s ^{2}}{2 σ ^{2}}} d y \geq \frac{e ^{- \frac{1}{2}}}{2 π σ} s .

\log{\cal N}(B_{1},\varepsilon)\leq C_{k}\big{(}\frac{1}{\varepsilon}\big{)}^{k},\ \forall\varepsilon>0.

\log{\cal N}(B_{1},\varepsilon)\leq C_{k}\big{(}\frac{1}{\varepsilon}\big{)}^{k},\ \forall\varepsilon>0.

θ = min {\frac{2}{q + w}, \frac{p}{p + 1}} \in (0, 1], r = \frac{p ( q + w )}{p + 1} > 0.

θ = min {\frac{2}{q + w}, \frac{p}{p + 1}} \in (0, 1], r = \frac{p ( q + w )}{p + 1} > 0.

\displaystyle\|\pi(f^{\epsilon}_{\bf z})-f_{q}\|_{L^{r}_{\rho_{X}}}\leq C^{*}\big{(}\log\frac{3}{\xi}\big{)}^{2}\sqrt{\log\frac{3}{\delta}}T^{-\Lambda}

\displaystyle\|\pi(f^{\epsilon}_{\bf z})-f_{q}\|_{L^{r}_{\rho_{X}}}\leq C^{*}\big{(}\log\frac{3}{\xi}\big{)}^{2}\sqrt{\log\frac{3}{\delta}}T^{-\Lambda}

\Lambda=\frac{1}{q+w}\min\big{\{}\eta,\alpha\beta,1-\frac{q(1-\beta)\alpha}{2},\frac{1}{2-\theta},\frac{1}{2+k-\theta}-\frac{k}{1+k}\vartheta\big{\}}

\Lambda=\frac{1}{q+w}\min\big{\{}\eta,\alpha\beta,1-\frac{q(1-\beta)\alpha}{2},\frac{1}{2-\theta},\frac{1}{2+k-\theta}-\frac{k}{1+k}\vartheta\big{\}}

ϑ < \frac{1 + k}{k ( 2 + k - θ )} .

ϑ < \frac{1 + k}{k ( 2 + k - θ )} .

\|f-f_{q}\|_{L^{r}_{\rho_{X}}}\leq C_{r}\big{(}{\cal E}(f)-{\cal E}(f_{q})\big{)}^{\frac{1}{q+w}}

\|f-f_{q}\|_{L^{r}_{\rho_{X}}}\leq C_{r}\big{(}{\cal E}(f)-{\cal E}(f_{q})\big{)}^{\frac{1}{q+w}}

C_{q, x} (t) = \int_{Y} ψ_{q} (y - t) d ρ_{x} (y) = \int_{y > t} (y - t)^{q} d ρ_{x} (y) + \int_{y < t} (t - y)^{q} d ρ_{x} (y), x \in X .

C_{q, x} (t) = \int_{Y} ψ_{q} (y - t) d ρ_{x} (y) = \int_{y > t} (y - t)^{q} d ρ_{x} (y) + \int_{y < t} (t - y)^{q} d ρ_{x} (y), x \in X .

C_{q, x}^{'} (t_{x}^{*}) = q \int_{y < t_{x}^{*}} (t_{x}^{*} - y)^{q - 1} d ρ_{x} (y) - q \int_{y > t_{x}^{*}} (y - t_{x}^{*})^{q - 1} d ρ_{x} (y) = 0,

C_{q, x}^{'} (t_{x}^{*}) = q \int_{y < t_{x}^{*}} (t_{x}^{*} - y)^{q - 1} d ρ_{x} (y) - q \int_{y > t_{x}^{*}} (y - t_{x}^{*})^{q - 1} d ρ_{x} (y) = 0,

\int_{y < t_{x}^{*}} (t_{x}^{*} - y)^{q - 1} d ρ_{x} (y) = \int_{y > t_{x}^{*}} (y - t_{x}^{*})^{q - 1} d ρ_{x} (y)

\int_{y < t_{x}^{*}} (t_{x}^{*} - y)^{q - 1} d ρ_{x} (y) = \int_{y > t_{x}^{*}} (y - t_{x}^{*})^{q - 1} d ρ_{x} (y)

\displaystyle C^{\prime}_{q,x}(s)=q\big{(}\int_{y<s}(s-y)^{q-1}d\rho_{x}(y)-\int_{y>s}(y-s)^{q-1}d\rho_{x}(y)\big{)}

\displaystyle C^{\prime}_{q,x}(s)=q\big{(}\int_{y<s}(s-y)^{q-1}d\rho_{x}(y)-\int_{y>s}(y-s)^{q-1}d\rho_{x}(y)\big{)}

\displaystyle=q\Big{(}\int_{y<0}(s-y)^{q-1}d\rho_{x}(y)+\int_{0\leq y<s}(s-y)^{q-1}d\rho_{x}(y)-\int_{y>s}(y-s)^{q-1}d\rho_{x}(y)\Big{)}

\displaystyle\geq q\Big{(}\int_{y<0}(-y)^{q-1}d\rho_{x}(y)+\int_{0\leq y<s}(s-y)^{q-1}d\rho_{x}(y)-\int_{y>s}(y-s)^{q-1}d\rho_{x}(y)\Big{)}.

\displaystyle C^{\prime}_{q,x}(s)\geq q\Big{(}\int_{y>0}y^{q-1}d\rho_{x}(y)+\int_{0\leq y<s}(s-y)^{q-1}d\rho_{x}(y)-\int_{y>s}(y-s)^{q-1}d\rho_{x}(y)\Big{)}

\displaystyle C^{\prime}_{q,x}(s)\geq q\Big{(}\int_{y>0}y^{q-1}d\rho_{x}(y)+\int_{0\leq y<s}(s-y)^{q-1}d\rho_{x}(y)-\int_{y>s}(y-s)^{q-1}d\rho_{x}(y)\Big{)}

\displaystyle\geq q\Big{(}\int_{y>0}y^{q-1}d\rho_{x}(y)+\int_{0\leq y<s}(s-y)^{q-1}d\rho_{x}(y)-\int_{y>s}y^{q-1}d\rho_{x}(y)\Big{)}

\displaystyle=q\Big{(}\int_{0<y\leq s}y^{q-1}d\rho_{x}(y)+\int_{0\leq y<s}(s-y)^{q-1}d\rho_{x}(y)\Big{)}

\displaystyle=q\int_{0\leq y\leq s}\Big{(}y^{q-1}+(s-y)^{q-1}\Big{)}d\rho_{x}(y)\geq 2^{1-q}qs^{q-1}\rho_{x}(\{y:0\leq y\leq s\}).

C_{q, x} (t) - C_{q, x} (0) \geq 2^{1 - q} \cdot q \int_{0}^{t} s^{q - 1} ρ_{x} ({y : 0 \leq y \leq s}) d s .

C_{q, x} (t) - C_{q, x} (0) \geq 2^{1 - q} \cdot q \int_{0}^{t} s^{q - 1} ρ_{x} ({y : 0 \leq y \leq s}) d s .

C_{q, x} (t) - C_{q, x} (0) \geq 2^{1 - q} \cdot q \int_{0}^{t} s^{q - 1} b (x) s^{w} d s = \frac{2 ^{1 - q} q}{q + w} b (x) t^{q + w} \geq \frac{2 ^{1 - q} q}{q + w} b (x) a (x)^{w} t^{q + w} .

C_{q, x} (t) - C_{q, x} (0) \geq 2^{1 - q} \cdot q \int_{0}^{t} s^{q - 1} b (x) s^{w} d s = \frac{2 ^{1 - q} q}{q + w} b (x) t^{q + w} \geq \frac{2 ^{1 - q} q}{q + w} b (x) a (x)^{w} t^{q + w} .

C_{q, x} (t) - C_{q, x} (0)

C_{q, x} (t) - C_{q, x} (0)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Sparse and Compressive Sensing Techniques · Mathematical Approximation and Integration

Full text

Learning Rates of Regression with $q$ -norm Loss and Threshold †00footnotetext:

Ting Hu

School of Mathematics and Statistics, Wuhan University

Luojia Hill, Wuhan 430072, China, [email protected]

Yuan Yao

School of Mathematical Sciences, Peking University

Beijing 100871, China, [email protected]

Abstract

This paper studies some robust regression problems associated with the $q$ -norm loss ( $q\geq 1$ ) and the $\epsilon$ -insensitive $q$ -norm loss in the reproducing kernel Hilbert space. We establish a variance-expectation bound under a priori noise condition on the conditional distribution, which is the key technique to measure the error bound. Explicit learning rates will be given under the approximation ability assumptions on the reproducing kernel Hilbert space.

Key Words and Phrases. Insensitive $q$ -norm loss, quantile regression, reproducing kernel Hilbert space, sparsity.

Mathematical Subject Classification. 68Q32, 41A25

1 Introduction

In this paper we consider regression with the $q$ -norm loss $\psi_{q}$ with $q\geq 1$ and an $\epsilon$ -insensitive $q$ -norm loss $\psi_{q}^{\epsilon}$ (to be defined) with a threshold $\epsilon>0$ . Here $\psi_{q}$ is the univariate function defined by $\psi_{q}(u)=|u|^{q}$ . For a learning algorithm generated by a regularization scheme in reproducing kernel Hilbert spaces, learning rates and approximation error will be presented when $\epsilon$ is chosen appropriately for balancing learning rates and sparsity.

For $q=1$ , the regression problem is the classical statistical method of least absolute deviations which is more robust than the least squares method and is resistant to outliers in data [4]. Its associated loss $\psi(u)=|u|,u\in{\rm I\!R},$ is widely used in practical applications for robustness. In fact, for all $q<2,$ the loss $\psi_{q}$ is less sensitive to outliers and is thus more robust than the square loss. Vapnik [13] proposed an $\epsilon$ -insensitive loss $\psi^{\epsilon}(u):{\rm I\!R}\to{\rm I\!R}_{+}$ to get sparsity in support vector regressions, which is defined by

[TABLE]

When fixing $\epsilon>0,$ error analysis was conducted in [12]. Xiang, Hu and Zhou [17, 18] showed how to accelerate learning rates and preserve sparsity by adapting $\epsilon$ . In [5], they discussed the convergence ability with flexible $\epsilon$ in an online algorithm. For the quantile regression with $\epsilon=0$ and a pinball loss having different slopes in different sides of the origin in ${\rm I\!R}$ [6], Steinwart and Christamann [10, 9] established comparison theorems and derived learning rates under some noise conditions.

In this paper, we apply the $q$ -norm loss $\psi_{q}$ with $q>1$ to improve the convexity of the insensitive loss $\psi$ . Our results show how the insensitive parameter $\epsilon$ that produces the sparsity can be chosen adaptively as the function of the sample size $\epsilon=\epsilon(T)\rightarrow 0$ when $T\rightarrow\infty$ , to affect the error rates of the learning algorithm (to be defined by (1.4)). Such results include some early studies as special cases.

In the sequel, assume that the input space $X$ is a compact metric space and the output space $Y={\rm I\!R}$ . Let $\rho$ be a Borel probability measure on $Z:=X\times Y$ , $\rho_{x}(\cdot)$ be the conditional distribution of $\rho$ at each $x\in X$ and $\rho_{X}$ be the marginal distribution on $X$ . For a measurable function $f:X\rightarrow Y,$ the generalization error ${\cal E}(f)$ associated with the $q$ -norm loss $\psi_{q}$ , is defined by

[TABLE]

Denote $f_{q}:X\rightarrow Y$ as the minimizer of the generalization error ${\cal E}(f)$ over all measurable functions. Its properties and the corresponding learning problem in the empirical risk minimization framework were discussed in [20]. When $q=1,$ the target function $f_{q}$ is a function containing the medians of the conditional distribution for all $x\in X$ . For symmetric distributions, the median is also the regression function, which is the conditional mean for given $X.$ We aim at learning the minimizer $f_{q}$ from a sample ${\bf z}=\{(x_{i},y_{i})\}_{i=1}^{T}\in Z^{T},$ which is assumed to be independently drawn according to $\rho$ . Inspired by the $\epsilon$ -insensitive loss [13], we introduce an $\epsilon$ -insensitive $q$ -norm loss $\psi^{\epsilon}_{q}$ which is defined by

[TABLE]

Our learning task will be carried out by a regularization scheme in reproducing kernel Hilbert spaces. With a continuous, symmetric and positive semidefinite function $K:X\times X\to{\rm I\!R}$ (called a Mercer kernel), the reproducing kernel Hilbert space (RKHS) ${\cal H}_{K}$ is defined as the completion of the span of $\{K_{x}=K(x,\cdot):x\in X\}$ with the inner product $\langle\cdot,\cdot\rangle_{K}$ satisfying $\langle K_{x},K_{u}\rangle_{K}=K(x,u).$ The regularization algorithm in the paper takes the form

[TABLE]

Here $\lambda>0$ is a regularization parameter. Our learning rates are stated in terms of approximation or regularization error, noise conditions, and the capacity of the RKHS. Our main goal is to study how the learned function $f_{{\bf z}}^{\epsilon}$ in (1.4) converges to the target function $f_{q}.$ There is a large literature [1, 16, 7] in learning theory for studying the approximation error or regularization error ${\cal D}(\lambda)$ of the triple $(K,\rho,q)$ defined by

[TABLE]

The regularization function is defined as

[TABLE]

In the sequel, let $L_{\rho_{X}}^{p}$ with $p>0$ be the space of p integrable functions with respect to $\rho_{X}$ and $\|\cdot\|_{L_{\rho_{X}}^{p}}$ be the norm in $L_{\rho_{X}}^{p}$ . A usual assumption on the regularization error ${\cal D}(\lambda)$ which imposes certain smoothness on ${\cal H}_{K}$ is

[TABLE]

with some $0<\beta\leq 1$ and ${\cal D}_{0}>0$ .

Remark 1.

Assumption (1.6) always holds with $\beta=0$ . When the target function $f_{q}\in{\cal H}_{K}$ and ${\cal H}_{K}$ is dense in $C(X)$ which consists of bounded continuous functions on $X$ , the approximation error ${\cal D}(\lambda)\rightarrow 0$ as $\lambda\rightarrow 0.$ Thus, the decay (1.6) is natural and can be illustrated in terms of interpolation spaces [7]. Define the integral operator $L_{K}:L^{2}_{\rho_{X}}\rightarrow L^{2}_{\rho_{X}}$ by $L_{K}(f)(x)=\int_{X}K(x,y)f(y)d\rho_{X},x\in X,f\in L^{2}_{\rho_{X}}$ and suppose that the minimizer $f_{q}$ is in the range of $L_{K}^{\nu}$ with $0<\nu\leq\frac{1}{2}$ . When $q=1,$ the approximation error ${\cal D}(\lambda)$ can be $O(\lambda^{\frac{\nu}{1-\nu}})$ for quantile regression [18]. When $q=2$ , ${\cal D}(\lambda)=O(\lambda^{2\nu})$ for the least square. For other $q>1$ , the associated loss $\psi_{q}$ is Lipschitz in a bounded domain and the corresponding ${\cal D}(\lambda)$ can be characterized by the ${\cal K}$ -functional [1], which can have the same polynomial decay as (1.6).

We assume that the conditional distribution $\rho_{x}(\cdot)$ is supported on $[-M,M],\ M>0$ at each $x$ and is non-degenerate, i.e. any non-empty open set of $Y$ has strictly positive measure, which ensures that the target function $f_{q}$ is unique. Without loss of generality, let the support of $\rho_{x}(\cdot)$ be $[-\frac{1}{2},-\frac{1}{2}]$ at each $x\in X$ and our analysis below is applicable for any $M>0.$ We will prove that in the next section. It is natural to project values of the learned function $f_{{\bf z}}^{\epsilon}$ onto some interval by the projection operator [1, 15].

Definition 1.

The projection operator $\pi$ on the space of measurable functions $f:X\rightarrow{\rm I\!R}$ onto the interval $[-1,1]$ is defined by

[TABLE]

To demonstrate our main result in the general case, we shall give the following learning rate in the special case when $K$ is $C^{\infty}$ .

Theorem 1.

Let $X\subset{\rm I\!R}^{n}$ and $K\in C^{\infty}(X\times X)$ . Assume that $f_{q}\in{\cal H}_{K}$ with $q>1$ , $\|f_{q}\|_{\infty}\leq\frac{1}{4}$ and the conditional distributions $\{\rho_{x}(\cdot)\}_{x\in X}$ have density functions given by

[TABLE]

where $A=2^{2\varphi+1}(\varphi+1),\ \varphi>0$ . Take $\lambda=\epsilon=T^{-\frac{q+\varphi+1}{2(q+\varphi)}}$ , then for any $0<\delta<1,$ with confidence $1-\delta,$ we have

[TABLE]

where $C^{\prime}$ is a constant independent of $T$ or $\delta.$

To state our main result in the general case, we need a noise condition on the measure $\rho$ introduced in [9, 10].

Definition 2.

Let $0<p\leq\infty$ and $w>0$ . We say that $\rho$ has a $p$ -average type $w$ if there exist two functions $b$ and $a$ from $X$ to ${\rm I\!R}$ such that $\{ba^{w}\}^{-1}\in L^{p}_{\rho_{X}}$ and for any $x\in X$ and $s\in(0,a(x)]$ , there holds

[TABLE]

and

[TABLE]

This assumption can be satisfied by many common conditional distributions such as Guassian, students’ t distributions and uniform distributions. In the following, we will give an example to illustrate Definition 2 in detail. More examples can be found in [9, 10].

Example 1.

We assume that the conditional distributions $\{\rho_{x}(\cdot)\}_{x\in X}$ are Guassian distributions with a uniform variance $\sigma>0$ , i.e. $\frac{d\rho_{x}}{dy}(y)=\frac{1}{\sqrt{2\pi}\sigma}\exp\{-\frac{(y-u_{x})^{2}}{2\sigma^{2}}\}$ , where $\{u_{x}\}_{x\in X}$ are expectations of the Gaussian distributions $\{\rho_{x}(\cdot)\}_{x\in X}$ . It is not difficult to check that the minimizer $f_{\rho}(x)$ can take the value of $u_{x}$ at each $x\in X,$ then for any $s\in(0,\sigma]$ , there holds

[TABLE]

By similarity, we also have that $\rho_{x}(\{y:f_{q}(x)-s\leq y\leq f_{q}(x)\}\geq\frac{e^{-\frac{1}{2}}}{\sqrt{2\pi}\sigma}s$ . Thus, the measure $\rho$ has a $\infty$ -average type $1$ .

Our error analysis is related to the capacity of the hypothesis space ${\cal H}_{K}$ which is measured by covering numbers.

Definition 3.

For a subset $S$ of $C(X)$ and $\varepsilon>0$ , the covering number ${\cal N}(S,\varepsilon)$ is the minimal integer $l\in{\rm I\!N}$ such that there exist $l$ disks with radius $\varepsilon$ covering $S$ .

The covering numbers of balls $B_{R}=\{f\in{\cal H}_{K}:\|f\|_{K}\leq R\}$ with $R>0$ of the RKHS have been well understood in the learning theory [22, 23]. In this paper, we assume for some $k>0$ and $C_{k}>0$ that

[TABLE]

Remark 2.

When $X$ is a bounded subset of ${\rm I\!R}^{n}$ and the RKHS ${\cal H}_{K}$ is a Sobolev space $H^{m}(X)$ with index $m$ , it is shown [22] that the condition (1.9) holds true with $k=\frac{2n}{m}$ . If the kernel $K$ lies in the smooth space $C^{\infty}(X\times X),$ then (1.9) is satisfied for an arbitrarily small $k>0.$ Another common way to measure the capacity of ${\cal H}_{K}$ is the empirical covering number [21], which is out of scope of our discussion in this paper.

Denote

[TABLE]

The following learning rates in the general case will be proved in Section 4. One need to point out that the proof of Theorem 2 is only applicable to the case $q>1$ . However, when $q=1$ , it is a special case of quantile regression and the same learning rates as those of Theorem 2 can be found in [17, 18].

Theorem 2.

Suppose that $\rho$ has a $p$ -average type $w$ for some $0<p\leq\infty$ and $\omega>0$ . Assume that the regularization error condition (1.6) is satisfied for some $0<\beta\leq 1$ and (1.9) holds with $k>0$ . Take $\lambda=T^{-\alpha},\epsilon=T^{-\eta}$ with $0<\alpha\leq 1$ , $0<\eta\leq\infty.$ Let $\xi>0.$ Then for any $0<\delta<1,$ with confidence $1-\delta,$ there holds

[TABLE]

where $C^{*}$ is a constant independent of $T$ or $\delta$ ,

[TABLE]

with $\vartheta=\max\left\{\frac{\alpha-\eta}{2},\frac{\alpha(1-\beta)}{2},\frac{\alpha}{2}+\frac{q(1-\beta)\alpha}{4}-\frac{1}{2},\frac{\alpha}{2}-\frac{1}{2(2-\theta)},\frac{[\alpha(2+k-\theta)-1](1+k)}{(2+k-\theta)(2+k)}+\xi\right\}\geq 0$ provided that

[TABLE]

Corollary 1.

Let $X\subset{\rm I\!R}^{n}$ , $K\in C^{\infty}(X\times X)$ . Assume (1.6) and (1.8). Take $\lambda=T^{-1}$ , $\epsilon=T^{-\eta}$ with $0<\eta\leq\infty$ . If $1<q\leq 2$ , then the index $\Lambda$ for the learning rate (1.11) is $\frac{1}{q+w}\min\{\eta,\beta,\frac{1}{2-\theta}\}$ .

Remark 3.

When $\eta=\infty$ , the corresponding threshold $\epsilon$ is [math] and it is a least square problem for $q=2$ , which is widely discussed in [15, 16]. If $\rho$ has a $\infty$ -average type $w$ with $w>0$ and $f_{q}\in{\cal H}_{K}$ , the learning rate $\|\pi(f^{\epsilon}_{\bf z})-f_{q}\|_{L^{2+w}_{\rho_{X}}}=O(T^{-\frac{1}{2(1+w)}})$ for the least square. It follows that the error $\|\pi(f^{\epsilon}_{\bf z})-f_{q}\|^{2}_{L^{2}_{\rho_{X}}}=O(T^{-\frac{1}{1+w}})$ by $\|\cdot\|_{L^{2}_{\rho_{X}}}\leq\|\cdot\|_{L^{2+w}_{\rho_{X}}}$ . Thus, it can be near the optimal rate $O(T^{-1})$ in $L^{2}_{\rho_{X}}$ space if $w$ is small enough.

When $1<q<2,$ the learning error will be $O(T^{-\frac{1}{q+w}\min\{\beta,\frac{q+w}{2(q+w-1)},\frac{p+1}{p+2}\}})$ with choice $\eta\geq\beta$ , depending only on the ${\cal H}_{K}$ ’s approximation ability (1.6) and noise condition (1.8). Specially, when $q$ goes to 1, it is the quantile regression [17, 18] and the best rate is $O(T^{-\frac{1}{1+w}})$ in this paper if $\rho$ has a $\infty$ -average type $w$ with $0<w\leq 1$ and $f_{q}\in{\cal H}_{K}$ .

2 Comparison and Perturbation Theorem

Approximation or learning ability of a regularized algorithm for regression problems can usually be studied by estimating the excess generalization error ${\cal E}(f)-{\cal E}(f_{q})$ for the learned function $f_{{\bf z}}^{\epsilon}$ from the algorithm (1.4). However the following comparison theorem would yield bounds for the error $\|f-f_{q}\|_{L^{r}_{\rho_{X}}}$ in the space $L^{r}_{\rho_{X}}$ when the noise condition is satisfied.

Theorem 3.

If $\rho$ has a p-average type $w$ , then for any measurable function $f:X\rightarrow[-1,1]$ we have the inequality

[TABLE]

where the constant $C_{r}=2^{\frac{q-1}{q+w}}q^{-\frac{1}{q+w}}(q+w)^{\frac{1}{q+w}}\|(ba^{w})^{-1}\|_{L^{p}_{\rho_{X}}}^{\frac{1}{q+w}}.$

Proof.

For a measurable function $f:X\rightarrow[-1,1]$ , the generalization error ${\cal E}(f)$ is rewritten as ${\cal E}(f)=\int_{X}C_{q,x}(f(x))d\rho_{X}$ where

[TABLE]

Denote $t^{*}_{x}=\min_{t\in{\rm I\!R}}C_{q,x}(t).$ It is obvious that the minimizer $f_{q}(x)$ of ${\cal E}(f)$ takes the value of $t^{*}_{x}$ for each $x\in X$ . Noting that the conditional distribution $\rho_{x}(\cdot)$ is supported on $[-\frac{1}{2},\frac{1}{2}]$ , the minimizer $t^{*}_{x}$ can be on $[-\frac{1}{2},\frac{1}{2}]$ . Consider the case $q>1.$ Since the loss function $\psi_{q}$ is differential and $|\frac{d\psi_{q}(y-t)}{dt}|\leq q|y-t|^{q-1}\leq q$ for all $y,t\in[-\frac{1}{2},\frac{1}{2}]$ , by the corollary of Lebesgue control convergence theorem, we can exchange the order of of integration and derivation of $C^{\prime}_{q,x}(t)$ as $C^{\prime}_{q,x}(t)=\frac{d}{dt}\int_{Y}\psi_{q}(y-t)d\rho_{x}(y)=\int_{Y}\frac{d\psi_{q}(y-t)}{dt}d\rho_{x}(y).$ This together with the fact $C^{\prime}_{q,x}(t^{*}_{x})=0,\forall x\in X,$ we have

[TABLE]

which means that

[TABLE]

Let $t^{*}_{x}=0$ for simply, then we have $C_{q,x}(t)-C_{q,x}(0)=\int_{0}^{t}C^{\prime}_{q,x}(s)ds,\forall t>0$ . Noting that for $s>0$ ,

[TABLE]

The above first term together with (2.2), then

[TABLE]

Thus,

[TABLE]

Let us consider the first case $t\in[0,a(x)].$ Noting the noise condition (1.8) and $a(x)\leq 1$ , we obtain that

[TABLE]

For the second case $t\in[a(x),1],$ we have

[TABLE]

In general, we can see that for any $0<t\leq 1$ ,

[TABLE]

By similarity, if $-1\leq t<0$ , we also have

[TABLE]

Applying the two above inequalities (2.3) and (2.4) with $t=f(x)$ and $t^{*}_{x}=f_{q}(x),$ we have that

[TABLE]

By $\frac{p}{p+1}$ power and integration,

[TABLE]

This with Holder inequality $\|\cdot\|_{L^{1}_{\rho_{X}}}\leq\|\cdot\|_{L^{p^{*}}_{\rho_{X}}}\|\cdot\|_{L^{q^{*}}_{\rho_{X}}},\frac{1}{p^{*}}+\frac{1}{q^{*}}=1$ , we obtain that for $p^{*}=p+1$ and $q^{*}=\frac{p+1}{p},$

[TABLE]

Then the desired conclusion (2.1) holds. For $q=1,$ (2.1) also holds and the proof can be found in [18]. ∎

It yields a variance-expectation bound which will be applied in the next section.

Lemma 1.

Under the same conditions as Theorem 3, for any measurable function $f:X\rightarrow[-1,1]$ , we have the inequality

[TABLE]

where the power index $\theta$ is defined as (1.10) and $C_{\theta}=C_{r}^{2}+2^{2-r}(1+\|f_{q}\|_{\infty}^{2-r})C_{r}^{r}$ .

Proof.

By the continuity of $\psi_{q}(u)$ and $|y|\leq\frac{1}{2}$ , we see that

[TABLE]

It implies that

[TABLE]

If $r>2,$ then

[TABLE]

Else,

[TABLE]

Combining the above two cases, we can get the conclusion (2.5). ∎

The threshold $\epsilon$ changes with the sample size $\epsilon=\epsilon(T)$ and plays a crucial role in the design of algorithm (1.4). By Taylor expansion, we have the following relation

[TABLE]

When the threshold $\epsilon\rightarrow 0,$ the $\epsilon$ -insensitive $q$ -norm loss $\psi_{q}^{\epsilon}$ converges to the $q$ -norm function $\psi_{q}$ almost surely. In the following, we shall study the approximation of the target function $f_{q}$ by $f_{q}^{\epsilon}$ which is the minimizer of the $\epsilon$ -generalization error ${\cal E}^{\epsilon}(f)=\int_{Z}\psi^{\epsilon}_{q}(f(x)-y)d\rho$ for $\epsilon>0$ . Denote

[TABLE]

and $t^{\epsilon}_{x}$ is the minimizer of $C_{q,x}^{\epsilon}(t)$ . By the same proof procedure as (2.2) in Theorem 3, we also get

[TABLE]

and $f_{q}^{\epsilon}$ takes the value of $t^{\epsilon}_{x}$ at each $x\in X$ . Then the perturbation properties hold. We use some ideas from [3] in the proof.

Proposition 1.

For $\epsilon>0,$ then

[TABLE]

For any measurable function $f$ on $X,$ we have

[TABLE]

Proof.

Suppose that there exist a $x\in X$ satisfying $f_{q}^{\epsilon}(x)-f_{q}(x)>\epsilon$ . Consider the case $q>1.$ Together with the fact (2.2) and $t_{x}^{*}=f_{q}(x)$ , we note that

[TABLE]

It is obvious that $f_{q}^{\epsilon}(x)+\epsilon>f_{q}(x)$ by the hypothesis that $f_{q}^{\epsilon}(x)-f_{q}(x)>\epsilon$ for any $\epsilon>0$ . By (2) with $t_{x}^{\epsilon}=f_{q}^{\epsilon}(x)$ , we also get

[TABLE]

Combining (2) with (2), we know that

[TABLE]

The above equalities hold if and only if $\rho_{x}(\{y:y>f_{q}(x)\})=0$ and $\rho_{x}(\{y:y<f_{q}^{\epsilon}(x)-\epsilon\})=0$ at the same time. Immediately, we see that $\rho_{x}(\{y:y\leq f_{q}(x)\})=1-\rho_{x}(\{y:y>f_{q}(x)\})=1$ . By the hypothesis $f_{q}^{\epsilon}(x)-f_{q}(x)>\epsilon$ , it follows that

[TABLE]

This is contradiction. By similarity, we get that $f_{q}^{\epsilon}(x)-f_{q}(x)<-\epsilon$ for each $x\in X$ . Then the desired conclusion (2.9) holds. By the relation (2.6) and $|y|\leq\frac{1}{2}$ , we can see that

[TABLE]

Then the desired conclusion (2.10) holds. ∎

We recall the fact that the conditional distribution $\rho_{x}(\cdot)$ is non-degenerate for each $x\in X,$ then the uniqueness of the minimizer $f_{q}^{\epsilon}$ is stated as following. For simply, we denote $f_{q}^{\epsilon}$ as the target function $f_{q}$ and ${\cal E}^{\epsilon}(f)$ as the generalization error ${\cal E}(f)$ with the $q$ -norm loss $\psi_{q}$ when $\epsilon=0$ in the next proposition.

Proposition 2.

For $0\leq\epsilon\leq\frac{1}{2},$ the function $f_{q}^{\epsilon}$ is the unique minimizer of the $\epsilon$ -generalization error ${\cal E}^{\epsilon}(f)$ .

Proof.

Suppose that $f_{q}^{\epsilon}$ is not the unique minimizer. For some $x\in X,$ there exists $t_{1}(x)<t_{2}(x)$ such that they are both the minimizers of $C_{q,x}^{\epsilon}(t)$ by (2.7) and satisfy the equality (2) with $t^{\epsilon}_{x}=t_{1}(x)$ or $t^{\epsilon}_{x}=t_{2}(x)$ . Applying (2) with $t^{\epsilon}_{x}=t_{1}(x)$ and $t_{1}(x)<t_{2}(x)$ , it follows that

[TABLE]

Applying (2) with $t^{\epsilon}_{x}=t_{2}(x)$ again, we see that the first term of the above inequality $\int_{y<t_{2}(x)-\epsilon}(t_{2}(x)-\epsilon-y)^{q-1}d\rho_{x}(y)$ is equal to the last term $\int_{y>t_{2}(x)+\epsilon}(y-t_{2}(x)-\epsilon)^{q-1}d\rho_{x}(y)$ . This implies

[TABLE]

The above equalities hold if and only if $\rho_{x}(\{y:y<t_{2}(x)-\epsilon\})=0$ and $\rho_{x}(\{y:y>t_{1}(x)+\epsilon\})=0$ simultaneously . Since $\rho_{x}(\cdot)$ is non-degenerate and supported on $[-\frac{1}{2},\frac{1}{2}]$ , then the values of $t_{1}(x)$ and $t_{2}(x)$ must satisfy $t_{2}(x)-\epsilon\leq-\frac{1}{2}$ and $t_{1}(x)+\epsilon\geq\frac{1}{2}$ . By the hypothesis $t_{1}(x)<t_{2}(x)$ , we get $\epsilon>\frac{1}{2}$ . This is contradict with $0\leq\epsilon\leq\frac{1}{2}$ . The proof is completed. ∎

3 Error Decomposition and Sample Error

Now we can conduct an error decomposition.

Lemma 2.

Define $f_{\lambda}$ by (1.5). Let $0\leq\epsilon\leq\frac{1}{2},$ then

[TABLE]

where

[TABLE]

Proof.

By the same procedure in [11, 14, 15, 16], ${\cal E}(\pi(f_{\bf z}^{\epsilon}))-{\cal E}(f_{q})+\lambda\|f_{\bf z}^{\epsilon}\|_{K}^{2}$ can be expressed as

[TABLE]

The relation (2.6) yields

[TABLE]

and

[TABLE]

The restriction $0\leq\epsilon\leq\frac{1}{2}$ implies ${\cal E}^{\epsilon}_{\bf z}(\pi(f_{\bf z}^{\epsilon}))\leq{\cal E}^{\epsilon}_{\bf z}(f_{\bf z}^{\epsilon})$ . By (3) and (3.5), then we have

[TABLE]

Since $[{\cal E}_{\bf z}^{\epsilon}(f_{\bf z}^{\epsilon})+\lambda\|f_{\bf z}^{\epsilon}\|_{K}^{2}]-[{\cal E}_{\bf z}^{\epsilon}(f_{\lambda})+\lambda\|f_{\lambda}\|_{K}^{2}]\leq 0,$ we have

[TABLE]

Then the desired conclusion holds. ∎

In the above error decomposition, the first two terms $S_{1}$ and $S_{2}$ are called sample error. For the second term $S_{2},$ we get the following estimation.

Corollary 2.

Assume that (2.5), there exists a subset $Z_{1,\delta}$ of $Z^{T}$ with measure at least $1-\frac{2\delta}{3}$ such that for any ${\bf z}\in Z_{1,\delta}$ ,

[TABLE]

Proof.

we can decompose $S_{2}$ into two parts $S_{2}=S_{2,1}+S_{2,2}$ , where

[TABLE]

For $S_{2,1},$ we apply the one-side Bernstein inequality [2] to the random variable $\xi(z)=\psi_{q}(f_{\lambda}(x)-y)-\psi_{q}(\pi(f_{\lambda})(x)-y)$ . For the continuity of the loss $\psi_{q}(u),$ it satisfies $0\leq\xi\leq q(\|f_{\lambda}\|_{\infty}+\|\pi(f_{\lambda})\|_{\infty}+|y|)^{q-1}|\pi(f_{\lambda})(x)-f_{\lambda}(x)|\leq q(2\|f_{\lambda}\|_{\infty}^{q-1}+1)(1+\|f_{\lambda}\|_{\infty})\leq q(1+5\|f_{\lambda}\|^{q}_{\infty}).$ Noting that $|\xi-\mathbb{E}(\xi)|\leq q(1+5\|f_{\lambda}\|^{q}_{\infty})$ and $\mathbb{E}(\xi-\mathbb{E}(\xi))^{2}\leq q(1+5\|f_{\lambda}\|^{q}_{\infty})\mathbb{E}(\xi),$ then there exists a subset $Z^{\prime}_{1,\delta}$ of $Z^{T}$ with measure at least $1-\frac{\delta}{3}$ such that for any ${\bf z}\in Z^{\prime}_{1,\delta}$ ,

[TABLE]

For $S_{2,2},$ we take the random variable $\xi(z)=\psi_{q}(\pi(f_{\lambda})(x)-y)-\psi_{q}(f_{q}(x)-y)$ which is bounded by $2^{q}$ and estimate the variance by Lemma 1 with $f=\pi(f_{\lambda})$ . Applying the one-side Bernstein inequality again, we find that there exists a subset ${\bf z}^{\prime\prime}_{1,\delta}$ of $Z^{T}$ with measure at least $1-\frac{\delta}{3}$ such that for any ${\bf z}\in Z^{\prime\prime}_{1,\delta}$ ,

[TABLE]

Combing the bound (3.7) and (3.8), we get the desired conclusion (3.6). ∎

Denote $\kappa=\sup_{x\in X}\sqrt{K(x,x)}.$ For $R\geq 1$ , let $B_{R}=\{{\bf z}\in Z^{T}:\|f\|_{K}\leq R\}$ .

Corollary 3.

Assume that (1.9) and (2.5). For any $f\in B_{R}$ , there exists a subset $Z_{2,\delta}$ of $Z^{T}$ with measure at least $1-\frac{\delta}{3}$ such that for all ${\bf z}\in Z_{2,\delta},$

[TABLE]

where

[TABLE]

Proof.

Consider the function set

[TABLE]

A function from this set $g(z)=\psi_{q}(\pi(f)(x)-y)-\psi_{q}(f_{q}(x)-y)$ satisfies $\mathbb{E}g\geq 0,$ $|g(z)|\leq 2^{q}$ and $\mathbb{E}g^{2}\leq C_{\theta}\big{(}\mathbb{E}g\big{)}^{\theta}$ by (2.5). The continuity of the loss implies $|\psi_{q}(\pi(f)(x)-y)-\psi_{q}(f_{q}(x)-y)|\leq q(2+\|f_{q}\|_{\infty})^{q-1}|\pi(f)(x)-f_{q}(x)|$ . Then

[TABLE]

We apply the ratio probability inequality with the covering number in [16],

[TABLE]

We take $\varepsilon^{*}(R,T,\delta/3)$ to be the positive solution to the equation

[TABLE]

It can be expressed as

[TABLE]

The positive solution $\varepsilon^{*}(R,T,\delta/3)$ to this equation can be bounded as

[TABLE]

Then there exists a subset $Z_{2,\delta}$ of $Z^{T}$ with measure at least $1-\frac{\delta}{3}$ such that for all ${\bf z}\in Z_{2,\delta},$

[TABLE]

For any ${\bf z}\in B(R)\bigcap Z_{2,\delta},$ we have

[TABLE]

Putting the above bounds into (3), then we get the desired conclusion (3). ∎

4 Estimating Total Error by Iteration

This section is devoted to estimating total error $\|\pi(f^{\epsilon}_{\bf z})-f_{q}\|_{L^{r}_{\rho_{X}}}.$ To apply Corollary 2 and Corollary 3 for error analysis, we get the rough bound

[TABLE]

by taking $f=0$ in (1.4). This bound will be improved by iteration technique used in [14]. For $R>0,$ denote

[TABLE]

Lemma 3.

Take $\lambda=T^{-\alpha},\epsilon=T^{-\eta}$ with $0<\alpha\leq 1$ , $0<\eta\leq\infty.$ Let $0<\xi<1.$ If $\rho$ satisfy the noise condition (1.8) and (1.6), (1.9) hold, then for any $0<\delta<1,$ with confidence $1-\delta,$ there exists a subset $V_{R}$ of $Z^{T}$ with measure at most $\delta$ such that holds

[TABLE]

where $\vartheta=\max\big{\{}\frac{[\alpha(2+k-\theta)-1](1+k)}{(2+k-\theta)(2+k)}+\xi,\frac{\alpha-\eta}{2},\frac{\alpha(1-\beta)}{2},\frac{\alpha}{2}+\frac{q(1-\beta)\alpha}{4}-\frac{1}{2},\frac{\alpha}{2}-\frac{1}{2(2-\theta)}\big{\}}$ .

Proof.

Applying Corollary 2 and Corollary 3 with Lemma 2, we know that for any ${\bf z}\in{\cal W}(R)\bigcap Z_{1,\delta}\bigcap Z_{2,\delta},R>1,$

[TABLE]

where $A_{1}$ and $A_{2}$ is given by

[TABLE]

Let $V_{R}$ be a set whose measure is at most $\delta.$ Putting $\lambda=T^{-\alpha},\epsilon=T^{-\eta}$ with $0<\alpha\leq 1$ , $0<\eta\leq\infty$ and (1.6) into the above bound, then for any $R>1$ we have

[TABLE]

where the constants $a_{T}$ and $b_{T}$ are given by

[TABLE]

with $\zeta=\max\left\{\frac{\alpha-\eta}{2},\frac{\alpha(1-\beta)}{2},\frac{\alpha}{2}+\frac{q(1-\beta)\alpha}{4}-\frac{1}{2},\frac{\alpha}{2}-\frac{1}{2(2-\theta)}\right\}.$ It follows that

[TABLE]

Let us apply the above relation iteratively to a sequence $\{R^{(j)}\}_{j=0}^{J}$ defined by $R^{(0)}=\lambda^{-\frac{1}{2}}$ and $R^{(j)}=a_{T}\big{(}R^{(j-1)}\big{)}^{\frac{k}{2+2k}}+b_{T}$ where $J\in{\rm I\!N}$ will be determined later. Then ${\cal W}(R^{(j-1)})\subseteq{\cal W}(R^{(j)})\cup V_{R^{(j-1)}}.$ Noting that ${\cal W}(R^{(0)})=Z^{T},$ then

[TABLE]

As the measure of $V_{R^{(j)}}$ is at most $\delta,$ we know that the measure of $\cup_{j=0}^{J-1}V_{R^{(j)}}$ is at most $J\delta.$ Hence ${\cal W}(R^{(J)})$ has measure at least $1-J\delta.$

Denote $\Delta=\frac{k}{2+2k}\leq\frac{1}{2}$ . The definition of the sequence $\{R^{(j)}\}_{j=0}^{J}$ implies that

[TABLE]

The first term

[TABLE]

Taking $J$ be the smallest integer greater than or equal to $\log\frac{1}{\xi}/\log 2$ . Then the upper bound is estimated by $A_{2}T^{\frac{[\alpha(2+k-\theta)-1](1+k)}{(2+k-\theta)(2+k)}+\xi}$ . The second term

[TABLE]

where $b_{1}=\sqrt{q2^{q}}+2\sqrt{{\cal D}_{0}}+\sqrt{12q{\cal D}_{0}^{q/2}\log\frac{3}{\delta}}+\sqrt{A_{1}\log\frac{3}{\delta}}.$

If $\zeta>\frac{[\alpha(2+k-\theta)-1](1+k)}{(2+k-\theta)(2+k)},$ it is bounded by $A_{2}b_{1}JT^{\zeta}.$ If $\zeta\leq\frac{[\alpha(2+k-\theta)-1](1+k)}{(2+k-\theta)(2+k)},$ it is bounded by $A_{2}b_{1}JT^{\frac{[\alpha(2+k-\theta)-1](1+k)}{(2+k-\theta)(2+k)}}.$

Thus we have

[TABLE]

where $\vartheta=\max\{\frac{[\alpha(2+k-\theta)-1](1+k)}{(2+k-\theta)(2+k)}+\xi,\zeta\}$ . With confidence $1-J\delta,$ there holds

[TABLE]

Noting $J\leq 2\log\frac{3}{\xi},$ then we can get (4.1) by replacing $\delta$ by $\delta/J$ . ∎

Now we can prove Theorem 2.

Proof of Theorem 2. By Lemma 3, there exists a subset $V_{R^{\prime}}\subset Z^{T}$ with measure at most $\delta$ such that $Z^{T}\setminus V_{R^{\prime}}\subseteq{\cal W}(R).$ Let $R$ be the right side of (4.1). Applying Corollary 2 and Corollary 3 to $R$ , then there exists another subset $V_{R}\subset Z^{T}$ with measure at most $\delta$ such that

[TABLE]

where $A_{3}=A_{2}(4A_{2})^{\frac{k}{k+1}}(1+\sqrt{q2^{q}}+2\sqrt{{\cal D}_{0}}+\sqrt{12q{\cal D}_{0}^{q/2}}+\sqrt{A_{1}})$ . By (2.1), we obtain that

[TABLE]

where

[TABLE]

and $\Lambda$ is given by (2). The restriction (1.12) ensures that $\Lambda>0.$ Replacing $\delta$ with $\delta/2,$ we complete the proof of Theorem 2.

Now we are in the state of proving Theorem 1.

Proof of Theorem 1. We shall prove Theorem 1 by Theorem 2. First, we check the noise condition (1.8). Let the function $a(x)=\frac{1}{4}$ and $b(x)=2^{2\varphi+1},$ $\forall x\in X$ . For $s\in[0,a(x)]=[0,\frac{1}{4}],$ then

[TABLE]

By similarity, $\forall s\in[0,\frac{1}{4}],$

[TABLE]

So we say that $\rho$ has a $\infty$ -average type $\varphi+1$ .

Since $f_{q}\in{\cal H}_{K}$ and $K\in C^{\infty}(X\times X),$ then (1.6) and (1.9) hold with $\beta=1$ and $k=0.$ Thus, $\theta=\frac{2}{q+\varphi+1}$ and $r=q+\varphi+1$ . Noting that the choice of $\lambda$ and $\epsilon$ satisfy (1.12) and $\Lambda>0$ . This complements our Theorem 1.

Proof of Corollary 1. It is an easy consequence of Theorem 2.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. R. Chen, Q. Wu, Y. M. Ying and D. X. Zhou, Support vector machine soft margin classifiers: error analysis, Journal of Machine Learning Research 2 (2004) 1143–1175.
2[2] L. Devroye, L. Györfi and G. Lugosi, A Probabilistic Theory of Pattern Recognition , Springer-Verlag, New York, 1997.
3[3] T. Hu, J. Fan, Q. Wu and D. X. Zhou, Regularization schemes for minimum error entropy principle, Analysis and Applications 13 , 437, 2015, DOI: 10.1142/S 0219530514500110.
4[4] P. J. Huber, Robust Statistics, Wiley, 1981.
5[5] T. Hu, D. H. Xiang and D. X. Zhou, Online learning for quantile regression and support vector regression, Journal of Statistical Planning and Inference 142 (2012), 3107–3122.
6[6] R. Koenker and G. Bassett, Regression quantiles, Econometrica 46 (1978), 33–50.
7[7] S. Smale and D. X. Zhou, Estimating the approximation error in learning theory, Anal. Appl. 1 (2003), 17–41.
8[8] I. Steinwart, How to compare different loss functions and their risks, Constr. Approx. 26 (2007) 225–287.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Learning Rates of Regression with qqq-norm Loss and Threshold *†*00footnotetext:

Abstract

1 Introduction

Remark 1**.**

Definition 1**.**

Theorem 1**.**

Definition 2**.**

Example 1**.**

Definition 3**.**

Remark 2**.**

Theorem 2**.**

Corollary 1**.**

Remark 3**.**

2 Comparison and Perturbation Theorem

Theorem 3**.**

Proof.

Lemma 1**.**

Proof.

Proposition 1**.**

Proof.

Proposition 2**.**

Proof.

3 Error Decomposition and Sample Error

Lemma 2**.**

Proof.

Corollary 2**.**

Proof.

Corollary 3**.**

Proof.

4 Estimating Total Error by Iteration

Lemma 3**.**

Proof.

Learning Rates of Regression with $q$ -norm Loss and Threshold †00footnotetext:

Remark 1.

Definition 1.

Theorem 1.

Definition 2.

Example 1.

Definition 3.

Remark 2.

Theorem 2.

Corollary 1.

Remark 3.

Theorem 3.

Lemma 1.

Proposition 1.

Proposition 2.

Lemma 2.

Corollary 2.

Corollary 3.

Lemma 3.