Learning Rates of Regression with q-norm Loss and Threshold
Ting Hu, Yuan Yao

TL;DR
This paper investigates robust regression methods using q-norm loss functions within reproducing kernel Hilbert spaces, providing theoretical error bounds and learning rates under noise conditions.
Contribution
It introduces variance-expectation bounds for q-norm loss regression and derives explicit learning rates based on kernel approximation assumptions.
Findings
Established variance-expectation bounds under noise conditions
Derived explicit learning rates for q-norm loss regression
Provided theoretical error bounds in RKHS setting
Abstract
This paper studies some robust regression problems associated with the -norm loss () and the -insensitive -norm loss in the reproducing kernel Hilbert space. We establish a variance-expectation bound under a priori noise condition on the conditional distribution, which is the key technique to measure the error bound. Explicit learning rates will be given under the approximation ability assumptions on the reproducing kernel Hilbert space.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Sparse and Compressive Sensing Techniques · Mathematical Approximation and Integration
Learning Rates of Regression with -norm Loss and Threshold *†*00footnotetext:
Ting Hu
School of Mathematics and Statistics, Wuhan University
Luojia Hill, Wuhan 430072, China, [email protected]
Yuan Yao
School of Mathematical Sciences, Peking University
Beijing 100871, China, [email protected]
Abstract
This paper studies some robust regression problems associated with the -norm loss () and the -insensitive -norm loss in the reproducing kernel Hilbert space. We establish a variance-expectation bound under a priori noise condition on the conditional distribution, which is the key technique to measure the error bound. Explicit learning rates will be given under the approximation ability assumptions on the reproducing kernel Hilbert space.
Key Words and Phrases. Insensitive -norm loss, quantile regression, reproducing kernel Hilbert space, sparsity.
Mathematical Subject Classification. 68Q32, 41A25
1 Introduction
In this paper we consider regression with the -norm loss with and an -insensitive -norm loss (to be defined) with a threshold . Here is the univariate function defined by . For a learning algorithm generated by a regularization scheme in reproducing kernel Hilbert spaces, learning rates and approximation error will be presented when is chosen appropriately for balancing learning rates and sparsity.
For , the regression problem is the classical statistical method of least absolute deviations which is more robust than the least squares method and is resistant to outliers in data [4]. Its associated loss is widely used in practical applications for robustness. In fact, for all the loss is less sensitive to outliers and is thus more robust than the square loss. Vapnik [13] proposed an -insensitive loss to get sparsity in support vector regressions, which is defined by
[TABLE]
When fixing error analysis was conducted in [12]. Xiang, Hu and Zhou [17, 18] showed how to accelerate learning rates and preserve sparsity by adapting . In [5], they discussed the convergence ability with flexible in an online algorithm. For the quantile regression with and a pinball loss having different slopes in different sides of the origin in [6], Steinwart and Christamann [10, 9] established comparison theorems and derived learning rates under some noise conditions.
In this paper, we apply the -norm loss with to improve the convexity of the insensitive loss . Our results show how the insensitive parameter that produces the sparsity can be chosen adaptively as the function of the sample size when , to affect the error rates of the learning algorithm (to be defined by (1.4)). Such results include some early studies as special cases.
In the sequel, assume that the input space is a compact metric space and the output space . Let be a Borel probability measure on , be the conditional distribution of at each and be the marginal distribution on . For a measurable function the generalization error associated with the -norm loss , is defined by
[TABLE]
Denote as the minimizer of the generalization error over all measurable functions. Its properties and the corresponding learning problem in the empirical risk minimization framework were discussed in [20]. When the target function is a function containing the medians of the conditional distribution for all . For symmetric distributions, the median is also the regression function, which is the conditional mean for given We aim at learning the minimizer from a sample which is assumed to be independently drawn according to . Inspired by the -insensitive loss [13], we introduce an -insensitive -norm loss which is defined by
[TABLE]
Our learning task will be carried out by a regularization scheme in reproducing kernel Hilbert spaces. With a continuous, symmetric and positive semidefinite function (called a Mercer kernel), the reproducing kernel Hilbert space (RKHS) is defined as the completion of the span of with the inner product satisfying The regularization algorithm in the paper takes the form
[TABLE]
Here is a regularization parameter. Our learning rates are stated in terms of approximation or regularization error, noise conditions, and the capacity of the RKHS. Our main goal is to study how the learned function in (1.4) converges to the target function There is a large literature [1, 16, 7] in learning theory for studying the approximation error or regularization error of the triple defined by
[TABLE]
The regularization function is defined as
[TABLE]
In the sequel, let with be the space of p integrable functions with respect to and be the norm in . A usual assumption on the regularization error which imposes certain smoothness on is
[TABLE]
with some and .
Remark 1**.**
Assumption (1.6) always holds with . When the target function and is dense in which consists of bounded continuous functions on , the approximation error as Thus, the decay (1.6) is natural and can be illustrated in terms of interpolation spaces [7]. Define the integral operator by and suppose that the minimizer is in the range of with . When the approximation error can be for quantile regression [18]. When , for the least square. For other , the associated loss is Lipschitz in a bounded domain and the corresponding can be characterized by the -functional [1], which can have the same polynomial decay as (1.6).
We assume that the conditional distribution is supported on at each and is non-degenerate, i.e. any non-empty open set of has strictly positive measure, which ensures that the target function is unique. Without loss of generality, let the support of be at each and our analysis below is applicable for any We will prove that in the next section. It is natural to project values of the learned function onto some interval by the projection operator [1, 15].
Definition 1**.**
The projection operator on the space of measurable functions onto the interval is defined by
[TABLE]
To demonstrate our main result in the general case, we shall give the following learning rate in the special case when is .
Theorem 1**.**
Let and . Assume that with , and the conditional distributions have density functions given by
[TABLE]
where . Take , then for any with confidence we have
[TABLE]
where is a constant independent of or
To state our main result in the general case, we need a noise condition on the measure introduced in [9, 10].
Definition 2**.**
Let and . We say that has a -average type if there exist two functions and from to such that and for any and , there holds
[TABLE]
and
[TABLE]
This assumption can be satisfied by many common conditional distributions such as Guassian, students’ t distributions and uniform distributions. In the following, we will give an example to illustrate Definition 2 in detail. More examples can be found in [9, 10].
Example 1**.**
We assume that the conditional distributions are Guassian distributions with a uniform variance , i.e. , where are expectations of the Gaussian distributions . It is not difficult to check that the minimizer can take the value of at each then for any , there holds
[TABLE]
By similarity, we also have that . Thus, the measure has a -average type .
Our error analysis is related to the capacity of the hypothesis space which is measured by covering numbers.
Definition 3**.**
For a subset of and , the covering number is the minimal integer such that there exist disks with radius covering .
The covering numbers of balls with of the RKHS have been well understood in the learning theory [22, 23]. In this paper, we assume for some and that
[TABLE]
Remark 2**.**
When is a bounded subset of and the RKHS is a Sobolev space with index , it is shown [22] that the condition (1.9) holds true with . If the kernel lies in the smooth space then (1.9) is satisfied for an arbitrarily small Another common way to measure the capacity of is the empirical covering number [21], which is out of scope of our discussion in this paper.
Denote
[TABLE]
The following learning rates in the general case will be proved in Section 4. One need to point out that the proof of Theorem 2 is only applicable to the case . However, when , it is a special case of quantile regression and the same learning rates as those of Theorem 2 can be found in [17, 18].
Theorem 2**.**
Suppose that has a -average type for some and . Assume that the regularization error condition (1.6) is satisfied for some and (1.9) holds with . Take with , Let Then for any with confidence there holds
[TABLE]
where is a constant independent of or ,
[TABLE]
with provided that
[TABLE]
Corollary 1**.**
Let , . Assume (1.6) and (1.8). Take , with . If , then the index for the learning rate (1.11) is .
Remark 3**.**
When , the corresponding threshold is [math] and it is a least square problem for , which is widely discussed in [15, 16]. If has a -average type with and , the learning rate for the least square. It follows that the error by . Thus, it can be near the optimal rate in space if is small enough.
When the learning error will be with choice , depending only on the ’s approximation ability (1.6) and noise condition (1.8). Specially, when goes to 1, it is the quantile regression [17, 18] and the best rate is in this paper if has a -average type with and .
2 Comparison and Perturbation Theorem
Approximation or learning ability of a regularized algorithm for regression problems can usually be studied by estimating the excess generalization error for the learned function from the algorithm (1.4). However the following comparison theorem would yield bounds for the error in the space when the noise condition is satisfied.
Theorem 3**.**
If has a p-average type , then for any measurable function we have the inequality
[TABLE]
where the constant
Proof.
For a measurable function , the generalization error is rewritten as where
[TABLE]
Denote It is obvious that the minimizer of takes the value of for each . Noting that the conditional distribution is supported on , the minimizer can be on . Consider the case Since the loss function is differential and for all , by the corollary of Lebesgue control convergence theorem, we can exchange the order of of integration and derivation of as This together with the fact we have
[TABLE]
which means that
[TABLE]
Let for simply, then we have . Noting that for ,
[TABLE]
The above first term together with (2.2), then
[TABLE]
Thus,
[TABLE]
Let us consider the first case Noting the noise condition (1.8) and , we obtain that
[TABLE]
For the second case we have
[TABLE]
In general, we can see that for any ,
[TABLE]
By similarity, if , we also have
[TABLE]
Applying the two above inequalities (2.3) and (2.4) with and we have that
[TABLE]
By power and integration,
[TABLE]
This with Holder inequality , we obtain that for and
[TABLE]
Then the desired conclusion (2.1) holds. For (2.1) also holds and the proof can be found in [18]. ∎
It yields a variance-expectation bound which will be applied in the next section.
Lemma 1**.**
Under the same conditions as Theorem 3, for any measurable function , we have the inequality
[TABLE]
where the power index is defined as (1.10) and .
Proof.
By the continuity of and , we see that
[TABLE]
It implies that
[TABLE]
If then
[TABLE]
Else,
[TABLE]
Combining the above two cases, we can get the conclusion (2.5). ∎
The threshold changes with the sample size and plays a crucial role in the design of algorithm (1.4). By Taylor expansion, we have the following relation
[TABLE]
When the threshold the -insensitive -norm loss converges to the -norm function almost surely. In the following, we shall study the approximation of the target function by which is the minimizer of the -generalization error for . Denote
[TABLE]
and is the minimizer of . By the same proof procedure as (2.2) in Theorem 3, we also get
[TABLE]
and takes the value of at each . Then the perturbation properties hold. We use some ideas from [3] in the proof.
Proposition 1**.**
For then
[TABLE]
For any measurable function on we have
[TABLE]
Proof.
Suppose that there exist a satisfying . Consider the case Together with the fact (2.2) and , we note that
[TABLE]
It is obvious that by the hypothesis that for any . By (2) with , we also get
[TABLE]
Combining (2) with (2), we know that
[TABLE]
The above equalities hold if and only if and at the same time. Immediately, we see that . By the hypothesis , it follows that
[TABLE]
This is contradiction. By similarity, we get that for each . Then the desired conclusion (2.9) holds. By the relation (2.6) and , we can see that
[TABLE]
Then the desired conclusion (2.10) holds. ∎
We recall the fact that the conditional distribution is non-degenerate for each then the uniqueness of the minimizer is stated as following. For simply, we denote as the target function and as the generalization error with the -norm loss when in the next proposition.
Proposition 2**.**
For the function is the unique minimizer of the -generalization error .
Proof.
Suppose that is not the unique minimizer. For some there exists such that they are both the minimizers of by (2.7) and satisfy the equality (2) with or . Applying (2) with and , it follows that
[TABLE]
Applying (2) with again, we see that the first term of the above inequality is equal to the last term . This implies
[TABLE]
The above equalities hold if and only if and simultaneously . Since is non-degenerate and supported on , then the values of and must satisfy and . By the hypothesis , we get . This is contradict with . The proof is completed. ∎
3 Error Decomposition and Sample Error
Now we can conduct an error decomposition.
Lemma 2**.**
Define by (1.5). Let then
[TABLE]
where
[TABLE]
Proof.
By the same procedure in [11, 14, 15, 16], can be expressed as
[TABLE]
The relation (2.6) yields
[TABLE]
and
[TABLE]
The restriction implies . By (3) and (3.5), then we have
[TABLE]
Since we have
[TABLE]
Then the desired conclusion holds. ∎
In the above error decomposition, the first two terms and are called sample error. For the second term we get the following estimation.
Corollary 2**.**
Assume that (2.5), there exists a subset of with measure at least such that for any ,
[TABLE]
Proof.
we can decompose into two parts , where
[TABLE]
For we apply the one-side Bernstein inequality [2] to the random variable . For the continuity of the loss it satisfies Noting that and then there exists a subset of with measure at least such that for any ,
[TABLE]
For we take the random variable which is bounded by and estimate the variance by Lemma 1 with . Applying the one-side Bernstein inequality again, we find that there exists a subset of with measure at least such that for any ,
[TABLE]
Combing the bound (3.7) and (3.8), we get the desired conclusion (3.6). ∎
Denote For , let .
Corollary 3**.**
Assume that (1.9) and (2.5). For any , there exists a subset of with measure at least such that for all
[TABLE]
where
[TABLE]
Proof.
Consider the function set
[TABLE]
A function from this set satisfies and \mathbb{E}g^{2}\leq C_{\theta}\big{(}\mathbb{E}g\big{)}^{\theta} by (2.5). The continuity of the loss implies . Then
[TABLE]
We apply the ratio probability inequality with the covering number in [16],
[TABLE]
We take to be the positive solution to the equation
[TABLE]
It can be expressed as
[TABLE]
The positive solution to this equation can be bounded as
[TABLE]
Then there exists a subset of with measure at least such that for all
[TABLE]
For any we have
[TABLE]
Putting the above bounds into (3), then we get the desired conclusion (3). ∎
4 Estimating Total Error by Iteration
This section is devoted to estimating total error To apply Corollary 2 and Corollary 3 for error analysis, we get the rough bound
[TABLE]
by taking in (1.4). This bound will be improved by iteration technique used in [14]. For denote
[TABLE]
Lemma 3**.**
Take with , Let If satisfy the noise condition (1.8) and (1.6), (1.9) hold, then for any with confidence there exists a subset of with measure at most such that holds
[TABLE]
where \vartheta=\max\big{\{}\frac{[\alpha(2+k-\theta)-1](1+k)}{(2+k-\theta)(2+k)}+\xi,\frac{\alpha-\eta}{2},\frac{\alpha(1-\beta)}{2},\frac{\alpha}{2}+\frac{q(1-\beta)\alpha}{4}-\frac{1}{2},\frac{\alpha}{2}-\frac{1}{2(2-\theta)}\big{\}}.
Proof.
Applying Corollary 2 and Corollary 3 with Lemma 2, we know that for any
[TABLE]
where and is given by
[TABLE]
Let be a set whose measure is at most Putting with , and (1.6) into the above bound, then for any we have
[TABLE]
where the constants and are given by
[TABLE]
with It follows that
[TABLE]
Let us apply the above relation iteratively to a sequence defined by and R^{(j)}=a_{T}\big{(}R^{(j-1)}\big{)}^{\frac{k}{2+2k}}+b_{T} where will be determined later. Then Noting that then
[TABLE]
As the measure of is at most we know that the measure of is at most Hence has measure at least
Denote . The definition of the sequence implies that
[TABLE]
The first term
[TABLE]
Taking be the smallest integer greater than or equal to . Then the upper bound is estimated by . The second term
[TABLE]
where
If it is bounded by If it is bounded by
Thus we have
[TABLE]
where . With confidence there holds
[TABLE]
Noting then we can get (4.1) by replacing by . ∎
Now we can prove Theorem 2.
Proof of Theorem 2. By Lemma 3, there exists a subset with measure at most such that Let be the right side of (4.1). Applying Corollary 2 and Corollary 3 to , then there exists another subset with measure at most such that
[TABLE]
where . By (2.1), we obtain that
[TABLE]
where
[TABLE]
and is given by (2). The restriction (1.12) ensures that Replacing with we complete the proof of Theorem 2.
Now we are in the state of proving Theorem 1.
Proof of Theorem 1. We shall prove Theorem 1 by Theorem 2. First, we check the noise condition (1.8). Let the function and . For then
[TABLE]
By similarity,
[TABLE]
So we say that has a -average type .
Since and then (1.6) and (1.9) hold with and Thus, and . Noting that the choice of and satisfy (1.12) and . This complements our Theorem 1.
Proof of Corollary 1. It is an easy consequence of Theorem 2.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] D. R. Chen, Q. Wu, Y. M. Ying and D. X. Zhou, Support vector machine soft margin classifiers: error analysis, Journal of Machine Learning Research 2 (2004) 1143–1175.
- 2[2] L. Devroye, L. Györfi and G. Lugosi, A Probabilistic Theory of Pattern Recognition , Springer-Verlag, New York, 1997.
- 3[3] T. Hu, J. Fan, Q. Wu and D. X. Zhou, Regularization schemes for minimum error entropy principle, Analysis and Applications 13 , 437, 2015, DOI: 10.1142/S 0219530514500110.
- 4[4] P. J. Huber, Robust Statistics, Wiley, 1981.
- 5[5] T. Hu, D. H. Xiang and D. X. Zhou, Online learning for quantile regression and support vector regression, Journal of Statistical Planning and Inference 142 (2012), 3107–3122.
- 6[6] R. Koenker and G. Bassett, Regression quantiles, Econometrica 46 (1978), 33–50.
- 7[7] S. Smale and D. X. Zhou, Estimating the approximation error in learning theory, Anal. Appl. 1 (2003), 17–41.
- 8[8] I. Steinwart, How to compare different loss functions and their risks, Constr. Approx. 26 (2007) 225–287.
