Optimal rates for F-score binary classification
Evgenii Chzhen (LAMA)

TL;DR
This paper establishes optimal minimax rates for binary classification using F-score under smoothness and margin assumptions, proposing a semi-supervised method that efficiently estimates the classifier with proven optimality.
Contribution
It introduces a semi-supervised classification procedure for F-score maximization that achieves minimax optimal rates under smoothness and margin conditions.
Findings
Achieves the rate $O(n^{-(1+eta)eta/(2eta+d)})$ for excess F-score.
Establishes the optimality of the proposed rates in a minimax sense.
Shows that unlabeled data size does not affect convergence rates.
Abstract
We study the minimax settings of binary classification with F-score under the -smoothness assumptions on the regression function for . We propose a classification procedure which under the -margin assumption achieves the rate for the excess F-score. In this context, the Bayes optimal classifier for the F-score can be obtained by thresholding the aforementioned regression function on some level to be estimated. The proposed procedure is performed in a semi-supervised manner, that is, for the estimation of the regression function we use a labeled dataset of size and for the estimation of the optimal threshold we use an unlabeled dataset of size . Interestingly, the value of does not affect the rate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Methods and Inference · Advanced Statistical Methods and Models · Machine Learning and Algorithms
∎
11institutetext: LAMA, Université Paris-Est 22institutetext: Cité Descartes
5 boulevard Descartes
77454 Marne-la-Vallée cedex 2
22email: [email protected]
Optimal rates for F-score binary classification
Evgenii Chzhen
Abstract
We study the minimax settings of binary classification with F-score under the -smoothness assumptions on the regression function for . We propose a classification procedure which under the -margin assumption achieves the rate for the excess F-score. In this context, the Bayes optimal classifier for the F-score can be obtained by thresholding the aforementioned regression function on some level to be estimated. The proposed procedure is performed in a semi-supervised manner, that is, for the estimation of the regression function we use a labeled dataset of size and for the estimation of the optimal threshold we use an unlabeled dataset of size . Interestingly, the value of does not affect the rate of convergence, which indicates that it is “harder” to estimate the regression function than the optimal threshold . This further implies that the binary classification with F-score behaves similarly to the standard settings of binary classification. Finally, we show that the rates achieved by the proposed procedure are optimal in the minimax sense up to a constant factor.
1 Introduction
The problem of binary classification is among the most basic and well-studied problems in statistics and machine learning Vapnik98 ; Yang99 ; Bartlett_Mendelson02 ; Audibert04 ; Massart_Nedelec06 ; Audibert_Tsybakov07 . Until very recently, theoretical guarantees were almost exclusively formulated in terms of the probability of miss-classification (a.k.a accuracy) as the measure of the risk. This choice of the risk is practically suitable in the case of the “well-balanced” distributions and datasets, that is, the probabilities to observe both classes are similar.
Once this assumption fails to be satisfied, classifiers based on the accuracy might perform poorly in practice. One possible approach to treat such a situation is to modify the measure to be optimized in an appropriate way. A popular choice of such measure is the F-score, whose roots can be tracked back to the information retrieval literature Rijsbergen74 ; Lewis95 . From the statistical point of view there are two alternative approaches Ye_Chai_Lee_Chieu12 ; Dembczynski_Kotlowski_Koyejo_Natarajan17 to the theoretical treatment of the F-score: Population Utility (PU) and Expected Test Utility (ETU). In this work we follow the PU approach which, as noted in Dembczynski_Kotlowski_Koyejo_Natarajan17 , has stronger roots in classical statistics. Our goal is to provide minimax analysis of the binary classification with F-score under non-parametric assumptions.
2 The problem formulation
We first introduce some notation that is used throughout this work. For any two real numbers we denote by (resp. ) the minimum (resp the maximum) between and . The standard Euclidean norm in is denoted by and a ball centered at of radius is denoted by . For positive real valued sequences we say that if there exists some positive constant such that for all it holds that . We consider a random couple taking values in with joint distribution . The vector is the feature vector and the binary variable is the label, in what follows we assume that . We denote by the marginal distribution of the feature vector and by the regression function. A classifier is any measurable function and the set of all such functions is denoted by .
We assume that we have access to two datasets: the first dataset consists of *i.i.d. *copies of ; and the second dataset consists of independent copies of . Denote by and the distributions of and respectively. Moreover, we denote by the expectation with respect to the distribution of , that is, with respect to on the space . We additionally assume that the size of the unlabeled dataset is not smaller that the size of the labeled dataset111Note that one can always satisfy this assumption by augmenting using a portion of and erasing labels. Typically, in practice it is easier to gather the unlabeled data then labeled, that is why this assumption is rather a formality., that is, . For a given classifier we define its -score222We decided to divide the classical definition of the -score by the factor to simplify the notation, thus, it is sufficient to multiply the obtained results by , to recover the results on the classical definition of the -score. for any by
[TABLE]
A Bayes-optimal classifier is any classifier that maximizes the F-score over all classifiers , that is,
[TABLE]
It was established by Zhao_Edakunni_Pocock_Brown13 that a maximizer of the -score can be obtained by comparing the regression function with a threshold . Importantly, this threshold depends explicitly on the distribution and can be obtained as unique root of
[TABLE]
One of the contributions of this work is extension of the result of (Zhao_Edakunni_Pocock_Brown13, , Section 6) for an arbitrary value of .
Theorem 2.1
A Bayes-optimal classifier can be obtained point-wise for all as
[TABLE]
where is a threshold which satisfies
[TABLE]
Moreover, the classifier satisfies .
The proof can be found in Appendix A Notice that if the optimal threshold is known a priori, the problem of binary classification with the F-score is no harder than the standard settings of binary classification with the accuracy as the measure of performance. As the threshold depends on the distribution , it could be estimated using data. Theorem 2.1 allows to obtain a trivial upper bound on the threshold , indeed, since and for any classifier the -score is upper bounded by we have .
For any classifier we define its excess score as
[TABLE]
The excess score is the central object of our analysis and one of our goals is to provide an estimator whose excess score is as small as possible. Using Theorem 2.1 we can show that the excess score of any classifier can be written in a simple form.
Lemma 1
Let be any classifier and assume that , then
[TABLE]
In general the Bayes optimal rule is not unique, Theorem 2.1 only states that one of the optimal classifiers has the form described by its statement. Even though, the function has unique root (see Appendix C for the proof), other thresholds may result in the same Bayes rule. Indeed, consider a simple example with , , then it is easy to see that the solution of is exactly , and every Bayes optimal classifier predicts one almost surely. Clearly, any threshold of the regression function results in the same classifier. Importantly, Lemma 1 and the equality are valid only for the threshold . In this work, we shall always refer to being the solution of and we call this threshold as the optimal threshold.
Remark 1
For the rest of the paper, we focus our attention only on the value to simplify the presentation. It will be clear from our arguments that the generalization of the theoretical results of the paper to an arbitrary value follows straightforwardly from our analysis.
Interestingly, the results above demonstrate that the problem of binary classification with F-score has a lot in common with the standard settings. Indeed, in both cases the Bayes optimal classifier is obtained via thresholding of the regression function and the expression for the excess risk is also similar. Consequently, in this work we address the following questions
- Q1.:
Is the problem of binary classification with F-score harder than its more known counterpart? In particular, can the minimax analysis of Audibert_Tsybakov07 be extended to these settings and what is an optimal algorithm?
- Q2.:
In view of recent results of Chzhen_Denis_Hebiri19 , we wonder if the introduction of unlabeled dataset can improve classification algorithms in the context of F-score.
Let us point out, that Lemma 1 is crucial for our analysis as it allows to adapt the scheme provided by Audibert_Tsybakov07 for the standard setting of the binary classification. Nevertheless, as the threshold is unknown beforehand, this machinery cannot be applied in a straightforward way and some effort is required. In this work, we pose similar assumptions on the distribution to the ones used in Audibert_Tsybakov07 .
Assumption 1** (-margin assumption)**
We say that the distribution of the pair satisfies the -margin assumption if there exist constants , and such that for every positive we have
[TABLE]
The case of “” is understood in the following manner Massart_Nedelec06 : there exists a constant such that
[TABLE]
typically this is the most advantageous situation for the binary classification, as the regression function is separated from the optimal threshold . Assumption 1 specifies the concentration rate of the regression function around the optimal threshold . Notice, if Assumption 1 is satisfied, it holds that for all
[TABLE]
where . This condition is tightly related to the rate of convergence in the case of the binary classification Audibert_Tsybakov07 ; Massart_Nedelec06 . The classification algorithm that is proposed in this work is based on a direct estimation of the regression function and the optimal threshold .
In the sequel, we consider the case of non-parametric estimation, that is we assume that the regression function lies in some class of -smooth functions and the marginal density of admits density *w.r.t. *to the Lebesgue measure supported on a well-behaved compact set and uniformly lower- and upper bounded. The exact formal description of these assumptions is given in Section 4.2, where we prove optimality of our rates. As for now, it is sufficient to assume that there exists a good estimator based on the labeled set of the regression function .
Assumption 2** (Existence of estimator)**
There exists an estimator based on which satisfies for all
[TABLE]
for some universal constants and an increasing sequence .
For instance, in the case of -smooth regression function333Typically, one also need to assume that the marginal distribution is well behaved, see Section 4.2. , a typical non-parametric rate is and it can be achieved by the local polynomial estimator, see (Audibert_Tsybakov07, , Theorem 3.2). Finally, in this work we assume that the probability is lower bounded by some constant which can be arbitrary small but fixed.
Assumption 3** (Lower bounded )**
We assume that there exists a positive constant such that .
It is assumed that the constants are independent of both , however these constants can depend on the dimension of the problem , on the value of as well as on each other. The values of the constants are not going to impact the rates of convergence, though they might and will enter as numerical constants in front of the rate. In contrast, the value of in the margin assumption will explicitly appear in the obtained rates.
3 Related works and contributions
Literature on the binary classification with F-score is rather broad, it spans both applied and theoretical studies of the problem. It should be noted that our work falls into the Population Utility (PU) approach Dembczynski_Kotlowski_Koyejo_Natarajan17 , that is, the expectation is taken in the numerator and the denominator of the F-score simultaneously. This approach should not be confused with the Expected Test Utility (ETU) approach, for which a non-asymptotic behavior can differ significantly. We refer the reader to Dembczynski_Kotlowski_Koyejo_Natarajan17 ; Ye_Chai_Lee_Chieu12 where the PU vs. ETU tale is discussed in depth and their asymptotic equivalency is established. Let us mention that, the asymptotic statistical theory of the binary classification with F-score has been studied in the prior literature Koyejo_Natarajan_Ravikumar_Dhillon14 ; Narasimhan_Vaish_Agarwal14 ; Menon_Narasimhan_Agarwal_Chawla13 ; Ye_Chai_Lee_Chieu12 . Bellow, we summarize our contributions and highlight the improvements with respect to the previous results on the non-asymptotic analysis of the binary classification with F-score.
- •
We propose a two-step estimator, which first estimates the regression function and then the optimal threshold . This type of two-step estimators, which involve an explicit thresholds tuning, are well-known in the literature and demonstrate a promising empirical performance Koyejo_Natarajan_Ravikumar_Dhillon14 ; Keerthi_Sindhwani_Chapelle07 . An important novelty introduced here is the semi-supervised nature of the procedure which can exploit the unlabeled data. It is already a well established fact that the semi-supervised methods might Singh_Nowak_Zhu09 or not Rigollet07 improve supervised estimation from statistical point of view. However, let us point out, that from the practical point of view, typically the most expensive part of the data gathering process is the correct labeling. Thus, one may assume that the unlabeled dataset is always available in reality and the settings are satisfied. Our analysis implies that in the setting of binary classification with F-score the semi-supervised techniques are not superior to the supervised ones. In contrast, in Chzhen_Denis_Hebiri19 the authors showed that in the context of confidence set classification semi-supervised classifiers might outperform it supervised counterparts.
- •
From the theoretical point of view, the most relevant reference is a recent work of Yan_Koyejo_Zhong_Ravikumar18 , where the authors studied a rather broad class of performance measures for the problem of binary classification, namely Karmic measures. This class includes the F-score, considered in the present manuscript. Under similar, though stronger assumptions on the distribution444The authors additionally require that the random variable on admits bounded density. of the pair they proposed an algorithm whose rate of convergence is at most . This rate is rather counter intuitive, since it suggests that if the constant in the margin assumption is large it does not affect the rate of convergence. In contrast, here we show that for the proposed algorithm the rate of convergence is of order . That is, it strictly improves upon the results in Yan_Koyejo_Zhong_Ravikumar18 whenever the constant . However, it should be noted, that the authors of Yan_Koyejo_Zhong_Ravikumar18 study a much more general family of the score functions and the sub-optimal rate can result from such a generality.
- •
We show that the constructed estimator is optimal in the minimax sense over the class of Hölder smooth regression functions. Let us mention that the optimality of the bound is expected, as in the classical work of Audibert_Tsybakov07 the authors showed that the minimax risk in the standard binary classification settings is of order , and it is achieved by a plug-in rule classifier. Clearly, it is hard to expect that the rate in a more difficult situation can be improved. Nevertheless, to the best of our knowledge, the minimax optimality in the context of binary classification with F-score have not been considered before.
The paper is organized as follows: in Section 4 we present the semi-supervised classification algorithm; in Section 4.1 we establish an upper bound on the excess F-score under the margin assumption; in Section 4.2 we introduce the class of distributions considered in this work and establish a minimax lower bound on the excess F-score.
4 Main results
In this section we describe the proposed procedure to estimate the Bayes optimal classifier , this procedure is performed in two steps. On the first step we estimate the regression function using the labeled data and on the second step we estimate the optimal threshold based on the unlabeled data and the estimator provided by the first step. This procedure falls into the category of plug-in type classifiers, that is, we formally replace all the unknown quantities in the Bayes rule by its estimates. That is, the classifier is defined as
[TABLE]
where is any estimator satisfying Assumption 2 and is the unique solution of
[TABLE]
In practice one can use a simple bisection algorithm (Conte_Boor80, , Algorithm 3.1) or its more sophisticated modifications (regula falsi or the secant method) to approximate with any given precision. For our theoretical analysis we assume that Equation (2) is solved exactly. However a simple modification of our arguments can handle the situation when the threshold is known up to an additive factor .
4.1 Upper bound
The main result of this subsection is an upper bound on excess score of the proposed procedure. Here we provide two theorems, the first one gives an upper bound on the expected difference between the optimal threshold and its estimate . The second one gives an upper bound on the excess F-score.
Theorem 4.1
If there exists an estimator of the regression function which satisfies Assumption 2, then there exists a constant which depends on such that, the threshold defined in Eq. (2) satisfies
[TABLE]
Theorem 4.2
If the distribution of satisfies the -margin assumption for some and and there exists an estimator of the regression function which satisfies Assumption 2, then there exists a constant which depends on such that
[TABLE]
where with the threshold defined in Equation (2).
Before proceeding to the proofs let us discuss the implications of these results. First of all, there are two regimes in the bound of Theorems 4.2, the first one is , in this regime, the dominant term is which is the classical rate of convergence in the standard settings of binary classification with the -margin assumption. The second regime is when , then the dominating term of the bound is . However, let us recall that one can always augment the second unlabeled dataset by dividing into two independent parts. It implies that the second regime never occurs in our theoretical analysis of the excess score and the upper bound is actually independent of . Similar reasoning holds for the case of the optimal threshold estimation in Theorem 4.1. Once it is clear that the obtained upper bounds are actually independent of the size of the unlabeled dataset it is interesting to notice that the dependence on is the same as in the standard case of the binary classification Audibert_Tsybakov07 . That is, similarly to the standard settings, the binary classification with F-score can achieve fast (faster than ) and even super-fast (faster than ) rate depending on the interplay of .
Proofs of both theorems relies on the following lemma, provided in Appendix B, which relates the difference of the thresholds to the difference of the cumulative distribution function empirical of (CDF) and empirical CDF of .
Lemma 2
Let be the threshold which satisfies Equation 2, then
[TABLE]
This result is the main reason why our conclusions on the semi-supervised estimation is different from the ones in Chzhen_Denis_Hebiri19 ; Singh_Nowak_Zhu09 . For instance, in Chzhen_Denis_Hebiri19 the authors also obtain a final decision rule by thresholding on some estimated level. However, in the present work the difference between and is controlled via -norm of difference of CDF’s, whereas in Chzhen_Denis_Hebiri19 they control a similar quantity through Wassertstein infinity distance.
The complete proof of Theorems 4.1 and 4.2 can be found in Appendix C, we only sketch the steps which are different from the analysis of Audibert_Tsybakov07 . Recall, that due to Lemma 1 we have the following bound for the excess score
[TABLE]
First of all, notice that if for some the event occurs, than we have
[TABLE]
which further implies that at least one of the following inequalities hold for this
[TABLE]
Thus, we can upper bound the excess risk as
[TABLE]
The first term on the right hand side () of the inequality can be handled by the peeling technique used in (Audibert_Tsybakov07, , Lemma 3.1.), which implies that, there exists a constant such that
[TABLE]
Hence, it remains to upper bound the second term on the right hand side of the inequality. Using Lemma 2 we can upper bound as
[TABLE]
with . Finally, we upper bound the indicator by the indicators of two events and which are defined as
[TABLE]
Thus, we have the following upper bound on
[TABLE]
Notice that thanks to the Dvoretzky-Kiefer-Wolfowitz inequality Dvoretzky_Kiefer_Wolfowitz56 ; Massart90 the term
[TABLE]
conditionally on admits an exponential concentration with the rate . Hence, using the margin assumption, one can effortlessly show there exists a constant such that
[TABLE]
For the second term we proceed as follows
[TABLE]
thus, using the -margin assumption we get
[TABLE]
the integral on the right hand side of the bound corresponds to the -Wasserstein distance on the real line, see for instance (Bobkov_Ledoux16, , Theorem 2.9) or Vallender74 for the proof, and can be further upper bounded by the norm between and , that is
[TABLE]
Since the estimator satisfies Assumption 2, one can show that there exists a constant such that
[TABLE]
Combination of all the inequalities yields the result of Theorem 4.2. Notice that the same reasoning starting from Lemma 2 implies the upper bound on the threshold estimation, that is, Theorem 4.1.
4.2 Lower bound
In the beginning of the section we state the class of distribution of the random pair considered in this work. The first assumption is made on smoothness of the regression function .
Definition 1 (Hölder smoothness)
Let and . The class of function consists of all functions such that for all , we have
[TABLE]
where is the Taylor polynomial of at point of degree .
Assumption 4** (-Hölder regression function)**
The distribution of the pair is such that for some positive .
Assumption 4 is usually not sufficient to guarantee the existence of an estimator satisfying Assumption 2: extra assumptions are required on the marginal distribution of the vector .
Definition 2
A Lebesgue measurable set is said to be -regular for some constants if for every and every we have
[TABLE]
where is the Lebesgue measure and is the Euclidean ball of radius centered at .
Assumption 5** (Strong density assumption)**
We say that the marginal distribution of the vector satisfies the strong density assumption if
- •
* is supported on a compact -regular set ,*
- •
* admits a density *w.r.t. to the Lebesgue measure uniformly lower- and upper-bounded by and respectively.
If the regression function is -Hölder and the marginal distribution satisfies the strong density assumption, one can state the following result due to Audibert_Tsybakov07 .
Theorem 4.3 (Audibert_Tsybakov07 )
Let be a class of distributions on such that the regression function and the marginal distribution satisfies the strong density assumption. Then, there exists an estimator of the regression function satisfying
[TABLE]
for come constants depending on .
Consider a class of distribution for which Assumptions 1, 3, 4, 5 are satisfied, then Theorem 4.3 and Theorems 4.1, 4.2 imply the following corollary.
Corollary 1
There exist constants which depend only on such that for any we have
[TABLE]
where the infima are taken over all estimators and respectively.
The next theorem states that the upper bounds of the previous corollary are optimal up to a constant multiplicative factor.
Theorem 4.4
If , there exists constants such that for any we have the following lower-bound on the minimax risk
[TABLE]
where the infimum is taken over all estimators .
The proof of the lower bound can be found in Appendix D, it follows standard information-theoretic arguments using reduction of the minimax risk to a Bayes risk. The construction of the distributions is inspired by both Rigollet_Vert09 and Audibert_Tsybakov07 , and the actual proof relies on (Audibert04, , Lemma 5.1.), which is based on the Assouad’s lemma, see for instance (Tsybakov09, , Lemma 2.12).
5 Conclusion
In this work we proposed a semi-supervised plug-in type algorithm for the problem of binary classification with F-score. The proposed algorithm can leverage an unlabeled dataset for the estimation of the optimal threshold. Under the margin assumption it is shown that the proposed algorithm is optimal in the minimax sense and can achieve fast rates of convergence. Further development of the binary classification with F-score will be devoted to empirical risk minimization rules.
Acknowledgements.
This work was partially supported by “Labex Bézout” of Université Paris-Est. Besides, we would like to thank Joseph Salmon and Mohamed Hebiri for their thoughtful remarks.
Appendix A Bayes classifier and Lemma 1
For the rest of this section the parameter is assumed to be fixed and known. Let us first recall the definition of the -score
[TABLE]
and an optimal classifier is defined as
[TABLE]
In this section we would like to show that a classifier defined for all as
[TABLE]
with being a root of
[TABLE]
Let us first show that is well-defined, that is, it exists and is unique for every distribution with . Hence, we would like to study solutions of the following equation
[TABLE]
Clearly, the mapping is continuous and strictly increasing on and the mapping is non-increasing on . Thus, it is sufficient to demonstrate that the mapping is continuous, indeed, let , then, due to the Lipschitz continuity of we can write
[TABLE]
This implies that the mapping is a contraction and thus is continuous. Hence, the threshold is well-defined, that is, it exists and is unique. Consequently, the classifier is well-defined.
Now, we are interested in the value , we can write
[TABLE]
using the definition of we continue as
[TABLE]
To conclude the optimality of we prove Lemma 1.
Proof
Fix an arbitrary measurable function , then by the definition of the excess score we have
[TABLE]
Using Theorem 2.1 we know that and therefore
[TABLE]
We conclude by solving the previous equality for . Thus, is a Bayes optimal classifier and hence can be denoted by .
Appendix B Proof of Lemma 2
Proof
To prove this lemma, it is convenient to rewrite Equation 2 in terms of CDF. Let be an arbitrary probability measure on and be any measurable function, then using Fubini’s theorem we can write
[TABLE]
and for any , since we have
[TABLE]
Let us denote by the empirical measure of the unlabeled dataset . Using these equalities, the thresholds satisfy
[TABLE]
Now, we are in position to bound the difference , first assume that , then
[TABLE]
Further, if we can write
[TABLE]
where the last inequality follows the same lines as for the case .
Appendix C Proof of the upper bound
Let be an estimator of the regression function based on the labeled dataset which satisfies Assumption 2. Recall, that the estimator is defined for every as
[TABLE]
with being the unique solution of Eq. (2). Unless stated otherwise, we work conditionally on . Using Lemma 1 we can express the excess score of as
[TABLE]
Clearly, on the event it holds that , thus
[TABLE]
Using, Lemma 2 the excess risk can be further upper bounded as
[TABLE]
Notice that , with being the cumulative distribution functions of respectively, corresponds to the 1-Wasserstein distance, see Bobkov_Ledoux16 for an in-depth discussion. Therefore, we have
[TABLE]
and introducing notation for the empirical measure of the feature vector we can write
[TABLE]
Finally, using the margin Assumption 1 we can write
[TABLE]
Taking expectation from the both sides with respect to the distribution of we follow (Audibert_Tsybakov07, , Lemma 3.1) to bound the first term on the right hand side. This peeling argument became classical in the literature and thus is omitted here. Moreover, using Assumption 2 the second term can be bounded with the same rate as the first term. These arguments would imply that there exists such that for all it holds that
[TABLE]
It remains to upper bound the second term in the bound above, to this end we recall the classical Dvoretzky-Kiefer-Wolfowitz inequality Massart90
Lemma 3 (Dvoretzky-Kiefer-Wolfowitz inequality)
Given , let be i.i.d. real-valued random variables with cumulative distribution function , denote by the cumulative distribution function with respect to the empirical measure, that is, with respect to , then for every we have
[TABLE]
Let us apply this lemma to , conditionally on these random variables are *i.i.d. *real-valued, thus for all
[TABLE]
Finally, to conclude the upper bound we apply this exponential concentration to upper bound the expectation as
[TABLE]
where we used the shortcut for the desired empirical process. Combining all the bounds we conclude.
Appendix D Proof of the lower bound
Proof
The proof is similar to the one used in Audibert_Tsybakov07 and in Rigollet_Vert09 and is based on Assouad lemma. Similarly, we define the regular grid on as
[TABLE]
and denote by as the closest point to of the grid to the point . Such a grid defines a partition of the unit cube denoted by . Besides, denote by for all . For a fixed integer and for any define , . For every we define a regression function as
[TABLE]
where are to be specified and are Euclidean balls of radius and respectively. The definition of the function is exactly the same as in Audibert_Tsybakov07 . That is, with some non-increasing infinitely differentiable function such that for and for . The function is defined as , where is non-decreasing infinitely differentiable function such that for and for . The constant is chosen big enough to ensure that for any .
For any we construct a marginal distribution which is independent of and has a density *w.r.t. *to the Lebesgue measure on . Fix some and set a Euclidean ball in that has an empty intersection with and whose Lebesgue measure is . The density is constructed as
- •
for every and every or ,
- •
for every ,
- •
for every other .
To complete the construction it remain to specify the value of . The idea here is to force the optimal threshold to be equal to some predefined constant using the additional degree of freedom provided by the parameter . Importantly, this optimal threshold should not depend on the binary vector . To achieve this recall that we set and show that there exists an appropriate choice of . First, recall that the optimal threshold satisfies
[TABLE]
Define and put , notice that the left hand side of the last equality for every is given by
[TABLE]
For the right hand side , there are two cases and , one can easily show that as long as there are no values of which allow to fix . Therefore, and we can write for every
[TABLE]
Finally, the parameter must satisfy the following equality
[TABLE]
solving for we get
[TABLE]
If we can ensure that the value of , that is, it is a valid choice for the regression function. Let us demonstrate that the margin assumption 1 holds for an appropriate choice of and . Define , then for every we have
[TABLE]
as long as we can continue as
[TABLE]
Therefore, if is of order the margin assumption is satisfied with . The strong density assumption can be checked similarly to Audibert_Tsybakov07 . To finish the prove, for every we denote by the distribution of with the marginal and the regression function . Thus, one can write for any
[TABLE]
where is the expectation taken *w.r.t. *to the *i.i.d. *realizations of and from and respectively, and if and if . The rest of the proof is obtained following the proof of (Audibert04, , Lemma 5.1.) and in particular the chain of inequalities in (Audibert04, , Eq. (6.26)). That is, we get for some independent from
[TABLE]
Finally, we conclude by setting the parameters as
[TABLE]
Note that thanks to the condition such a choice is is always valid for appropriately chosen constants .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) Audibert, J.Y.: Aggregated estimators and empirical complexity for least square regression. Ann. Inst. H. Poincaré Probab. Statist. 40 (6), 685–736 (2004)
- 2(2) Audibert, J.Y., Tsybakov, A.B.: Fast learning rates for plug-in classifiers. Ann. Statist. 35 (2), 608–633 (2007)
- 3(3) Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3 (Spec. Issue Comput. Learn. Theory), 463–482 (2002)
- 4(4) Bobkov, S., Ledoux, M.: One-dimensional empirical measures, order statistics and Kantorovich transport distances (2016). To appear in the Memoirs of the Amer. Math. Soc.
- 5(5) Chzhen, E., Denis, C., Hebiri, M.: Minimax semi-supervised confidence sets for multi-class classification (2019). Preprint, https://arxiv.org/abs/1904.12527
- 6(6) Conte, S., Boor, C.: Elementary Numerical Analysis: An Algorithmic Approach, 3rd edn. Mc Graw-Hill Higher Education (1980)
- 7(7) Dembczynski, K., Kotłowski, W., Koyejo, O., Natarajan, N.: Consistency analysis for binary classification revisited. In: ICML, pp. 961–969. JMLR. org (2017)
- 8(8) Dvoretzky, A., Kiefer, J., Wolfowitz, J.: Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Statist. 27 (3), 642–669 (1956)
