Minimax semi-supervised confidence sets for multi-class classification
Evgenii Chzhen (LAMA), Christophe Denis (LAMA), Mohamed Hebiri (LAMA)

TL;DR
This paper develops semi-supervised confidence set classifiers for multi-class problems, achieving faster convergence rates than supervised methods under certain assumptions, with theoretical guarantees and empirical validation.
Contribution
It introduces a semi-supervised minimax framework for confidence set classification with controlled size, establishing convergence rates and demonstrating superiority over supervised methods.
Findings
Semi-supervised estimators outperform supervised ones with enough unlabeled data.
Achieves faster convergence rates under margin and Hölder conditions.
Empirical results confirm theoretical convergence improvements.
Abstract
In this work we study the semi-supervised framework of confidence set classification with controlled expected size in minimax settings. We obtain semi-supervised minimax rates of convergence under the margin assumption and a H{\"o}lder condition on the regression function. Besides, we show that if no further assumptions are made, there is no supervised method that outperforms the semi-supervised estimator proposed in this work. We establish that the best achievable rate for any supervised method is n^{--1/2} , even if the margin assumption is extremely favorable. On the contrary, semi-supervised estimators can achieve faster rates of convergence provided that sufficiently many unlabeled samples are available. We additionally perform numerical evaluation of the proposed algorithms empirically confirming our theoretical findings.
| rate | rate | |||
|---|---|---|---|---|
| , | NO | |||
| NO | ||||
| YES | ||||
| YES |
| -Oracle | - Oracle | |
|---|---|---|
| 2 | 0.05 (0.01) | 0.09 (0.01) |
| 5 | 0.00 (0.00) | 0.01 (0.00) |
| 2 | 2.00 (0.03) | 2.00 (0.03) |
| 5 | 5.00 (0.08) | 5.00 (0.06) |
| 10 | 10.00 (0.13) | |
| 20 | 20.02 (0.31) |
| - | ||||||
| rforest | softmax reg | deep learn | rforest | softmax reg | deep learn | |
| 2 | 0.09 (0.01) | 0.06 (0.01) | 0.09 (0.01) | 0.13 (0.01) | 0.10 (0.01) | 0.13 (0.02) |
| 5 | 0.01 (0.00) | 0.00 (0.00) | 0.01 (0.00) | 0.02 (0.00) | 0.01 (0.00) | 0.02 (0.00) |
| rforest | softmax reg | deep learn | rforest | softmax reg | deep learn | |
| 2 | 2.01 (0.09) | 2.01 (0.10) | 2.02 (0.11) | 2.00 (0.02) | 2.00 (0.03) | 2.00 (0.03) |
| 5 | 5.02 (0.18) | 4.99 (0.20) | 5.00 (0.21) | 5.00 (0.06) | 5.00 (0.08) | 5.00 (0.07) |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Minimax semi-supervised confidence sets for multi-class classification
Evgenii Chzhenlabel=e1][email protected] [
Christophe Denislabel=e2][email protected] [
Mohamed Hebiri label=e3][email protected] [ LAMA, Université Paris-Est – Marne-la-Vallée
Université Paris-Est – Marne-la-Vallée
Cité Descartes, Bâtiment Copernic
5 boulevard Descartes
77454 Marne-la-Vallée cedex 2
E-mail: e2
E-mail: e3
Abstract
In this work we study the semi-supervised framework of confidence set classification with controlled expected size in minimax settings. We obtain semi-supervised minimax rates of convergence under the margin assumption and a Hölder condition on the regression function. Besides, we show that if no further assumptions are made, there is no supervised method that outperforms the semi-supervised estimator proposed in this work. We establish that the best achievable rate for any supervised method is , even if the margin assumption is extremely favorable. On the contrary, semi-supervised estimators can achieve faster rates of convergence provided that sufficiently many unlabeled samples are available. We additionally perform numerical evaluation of the proposed algorithms empirically confirming our theoretical findings.
62G05,
62G30, 62H05, 68T10,
multi-class classification,
confidence sets,
minimax optimality,
semi-supervised classification,
keywords:
[class=MSC]
keywords:
\startlocaldefs\endlocaldefs
,
and
t1This work was partially supported by “Labex Bézout” of Université Paris-Est
1 Introduction
Let and be a random couple distributed according to a distribution on , where is seen as the feature vector and as the class. This problem falls within the scope of the multi-class setting where the goal is to predict the label for a given feature. Commonly, prediction is performed by a classifier that outputs a single label. However, in the confidence set framework, the objective differs: we aim at predicting a set of labels instead of a single one. This problem has been studied in a few works, and we consider in this contribution the setup put forward by Denis and Hebiri (2017). The essential feature of their perspective is the control of the size of confidence sets in expectation. While they provided a procedure to build confidence sets based on Empirical Risk Minimization (ERM) and established upper bounds, the present work aims at giving a general analysis of the confidence problem in the minimax sense.
1.1 Problem statement
All along the paper, we denote by the marginal distribution of and by the regression function defined for all and all as . For any sets we denote by their symmetric difference. We assume that two data samples are available. The first sample consists of *i.i.d. *copies of and the second sample consist of *i.i.d. *copies of .
A confidence set classifier is a measurable function from to , that is, and we denote by the set of all such functions. For any confidence set we define its error and its information as
[TABLE]
respectively, where stands for the expectation *w.r.t. *the marginal distribution of and is the cardinal of at .
For a fixed integer a -Oracle confidence set is defined as
[TABLE]
The set is always non-empty, as it always contain those confidence sets whose cardinal is equals to for every .
The description of -Oracle confidence set in general situation might be complicated. Hence, we introduce the following mild assumption, which allows to obtain an explicit expression.
Assumption 1.1** (Continuity of CDF).**
For all the cumulative distribution function (CDF) of is continuous on .
Proposition 1.2** (-Oracle confidence set).**
Fix , and let the function be defined for all as
[TABLE]
then under Assumption 1.1 a -Oracle confidence set can be obtained as
[TABLE]
where we denote by the generalized inverse of defined for all as
[TABLE]
Proposition 1.3**.**
Assume that Assumption 1.1 is fulfilled, then the -Oracle defined in Eq. (1.1) is a minimizer of the following risk
[TABLE]
These propositions have been proven in (Denis and Hebiri, 2017, Proposition 4 and Proposition 7). Consequently, the accuracy of a confidence set can be for instance quantified according its excess risk
[TABLE]
The statistical learning problem is then to estimate given the data sample and . The formulation in Eq. (1.1) of the -Oracle appears to be closely related to the level set estimation problem (Hartigan, 1987; Polonik, 1995; Tsybakov, 1997; Rigollet and Vert, 2009). Hence at first sight, the introduction of an unlabeled sample may be surprising. However, in our setup the estimation of the -Oracle does not only rely on the regression function but also on the threshold which is unknown beforehand and can be estimated in a semi-supervised way (Denis and Hebiri, 2017). To fix these ideas, we give some examples of possible estimation procedures of .
1.2 Confidence set estimators
An estimator is a measurable function that maps any given data samples into a confidence set classifier. We shall distinguish two types of estimators: supervised and semi-supervised whose formal definition is provided below.
Definition 1.4** (Supervised and semi-supervised estimators).**
A measurable mapping
[TABLE]
is called a supervised estimator if for any and any data samples , , and it holds that
[TABLE]
Otherwise the estimator is called semi-supervised. In the sequel, for the simplicity of notation we write instead of where no ambiguity is present.
Intuitively, the supervised estimators do not take into account the information that is provided by the unlabeled sample. Besides, if we denote by the set of all estimators, Definition 1.4 generates a natural partition of into two disjoint sets: the supervised estimators and the semi-supervised estimators .
Hereafter, we provide three different examples of estimation procedures which are the core of our study. All these methods rely on plug-in principle.
- •
Top- procedure. This method is the most intuitive estimator in the considered context. It is a supervised procedure, that is, based only on . Let consider an estimator of the regression function . Let be the order statistic associated to , such that for all we have . A top- confidence set is then defined as
[TABLE]
- •
Supervised procedure. Formally, in this type of methods, we only care about (we forget about ). We split into two independent samples such that . Based on the first sample , we consider an estimator of the regression function . Furthermore, we define
[TABLE]
and one type of supervised estimator is then defined as follows
[TABLE]
Interestingly, conditional on the data sample , the definition of the estimator does not involves the labels associated to . As a consequence, we can naturally consider a semi-supervised version of this estimator.
- •
Semi-supervised procedure. Based on , we consider an estimator of the regression function . Furthermore, we define
[TABLE]
and one type of semi-supervised estimator is then defined as follows
[TABLE]
One can note that these procedures are based on a preliminary estimator of built from , that is, all of them are plug-in type procedures. However, these procedures differ by the construction of the output set. The top- procedure and the supervised procedure rely only on the labeled data while the semi-supervised estimator takes advantage of the information provided by the unlabeled data. The top- procedure is the simplest among them, it naturally satisfies for all . At the same time, the others are more involved and can have different cardinals for different values of . Nevertheless, for the other two procedures one can guarantee .
These examples give a rise to natural questions which form the core our theoretical study and which are summarized below.
The first question is the statistical performance of these plug-in procedures which is assessed through rates of convergence and their optimality in the minimax sense. 2. 2.
The second question focuses on the benefit of the semi-supervised approach. Roughly speaking, are there situations where the semi-supervised approach outperforms the supervised one and how can it be quantified? 3. 3.
The third question concentrates on the reason why it is more relevant for this problem to consider more involved estimators than the simple top- method.
1.3 Minimax estimation
For a given family of joint distributions on , a given estimator , and fixed integers , , we are interested in the following maximal risks
[TABLE]
where denotes the expectation w.r.t. . These maximal risks are arising in a natural way in the context of the confidence set estimation with controlled expected size. The risk corresponds to the estimation of the -Oracle through the Hamming distance. The second risks is directly connected with Proposition 1.2, which gives a description of the -Oracle as a minimizer of . As the goal in this problem is to construct a procedure that exhibits a low error and low cardinal discrepancy , it is natural to consider which is composed of both.
Finally, we are in position to define the notion of the minimax rate. The minimax rate in this context is not only determined by the family of distributions but also by the family of estimators that we consider.
Definition 1.5** (Minimax rate of convergence).**
For a given family of joint distributions on and a given family of estimators the minimax rates are defined as
[TABLE]
where is , or .
The main families of estimators that we study are the supervised and the semi-supervised estimators. Obviously, since and , we have the following relation
[TABLE]
As a consequence, a lower and an upper bounds on , yield the bounds on the minimax rate over all estimators.
1.4 Related works
Confidence set approach for classification was pioneered by Vovk (2002a, b); Vovk, Gammerman and Shafer (2005) by the means of conformal prediction theory. They rely on non-conformity measures which are based on some pattern recognition methods, and develop an asymptotic theory. In this work, we consider a statistical perspective of confidence set classification and put our focus on non-asymptotic minimax theory.
The problem of confidence set multi-class classification has strong ties with the binary classification with reject option, also known as binary classification with abstention in machine learning literature. In the binary classification with rejection, a classifier is allowed to output some special symbol, which indicates the rejection. Such type of classifiers can be seen as confidence sets, which are allowed to output or and are interpreted as reject. This line of research was initiated by Chow (1957, 1970) in the context of information retrieval, where a predefined cost of rejection was considered. An extensive statistical study of this framework was carried in (Herbei and Wegkamp, 2006; Bartlett and Wegkamp, 2008; Wegkamp and Yuan, 2011).
Instead of considering a fixed cost for rejection, which might be too restrictive, one may define two entities: probability of rejection and the probability of missclassification. In the spirit of conformal prediction, Lei (2014) aims at minimizing the probability rejection provided a fixed upper bound on the probability of missclassification. In contrast, Denis and Hebiri (2015) consider a reversed problem of minimizing the probability of missclassification given a fixed upper bound on the probability of rejection.
Once the multi-class classification is considered, there are several possible ways to extend the binary case: the confidence set approach and the rejection approach. The reject counterpart is a more studied and known version, though it lacks statistical analysis. To the best of our knowledge the only work which provides statistical guarantees is (Ramaswamy, Tewari and Agarwal, 2018).
As for the confidence set approach there are again two possibilities, similar to the binary case. The one that is considered in this work was proposed by Denis and Hebiri (2017), where the authors analyse an ERM algorithm and derive oracle inequalities under the margin assumption (Tsybakov, 2004). More specifically, they consider a convex surrogate of the error which relies on a convex real valued loss function . For a suitable choice of the convex function they show that, under Assumption 1.1, their -Oracle satisfies
[TABLE]
where the function depends on and the value of is defined similarly to the present manuscript. They propose a two step estimation procedure of the -Oracle set. Based on the ERM algorithm, they first estimate and in the second step, they estimate the threshold with an unlabeled sample. This procedure is in the same spirit as the semi-supervised procedure (1.5). Under mild assumptions, they provide an upper bound on the excess risk and obtain a rate of convergence of order , with being a parameter that depends on the function and being the margin parameter. Note that this rate is slower than the rate obtained in the standard classification framework.
The conformal prediction theory (Vovk, Gammerman and Shafer, 2005) suggests to minimize the information level with a fixed budget on the error level. Statistical properties of this framework were considered in the work of Sadinle, Lei and Wasserman (2018). Their objective is formulated for some as
[TABLE]
and such a confidence set is called a least ambiguous confidence set with bounded error rate. The authors show that under Assumption 1.1 this oracle set can be described as a thresholding of the regression function
[TABLE]
where the threshold is defined as
[TABLE]
Notice that this framework is very similar to (Denis and Hebiri, 2017) in the treatment of the Bayes optimal confidence set, as in both cases they are obtained via thresholding of the posterior distribution of the labels. Sadinle, Lei and Wasserman (2018) also proceed in two steps as here, that is, they first estimate the posterior distribution for all and estimate the threshold after. However, they require the second labeled dataset for the estimator of , due to the presence of , the marginal distribution of the labels. Besides, their theoretical analysis is carried out under a different set of assumptions on the joint distribution . Apart from the standard margin assumption, they require a so-called detectability, that is, they require that the upper bound in the margin assumption is tight. Under these assumptions they provide an upper bound on the Hamming excess risk and obtain a rate of convergence of order .
Interestingly, both approaches can be encompassed into the constrained estimation framework (Anbar, 1977; Lepskii, 1990; Brown and Low, 1996), where one would like to construct an estimator with some prescribed properties. These properties are typically reflected by the form of the risk which in our case is the discrepancy measure, that is, the sum of error and information discrepancies. Thus, both frameworks of Sadinle, Lei and Wasserman (2018); Denis and Hebiri (2017) can be seen as an extension of the constrained estimation to the classification problems. From the modeling point of view, we believe that the two frameworks can co-exist nicely and a particular choice depends on the considered application. The major difference between the present work and those by Denis and Hebiri (2017) and Sadinle, Lei and Wasserman (2018) is the minimax analysis which we provide here and our treatment of semi-supervised techniques.
As already pointed out, the confidence set estimation problem is closely related to the level set estimation setup (Hartigan, 1987; Polonik, 1995; Tsybakov, 1997; Rigollet and Vert, 2009). This problem focuses on the estimation of a level set defined as
[TABLE]
where is the density of the observations and is some fixed value. Given a sample distributed according the density the goal is to estimate . In (Rigollet and Vert, 2009), the authors study plug-in density level set estimators through the measure of symmetric differences and the excess mass. In confidence set estimation the measure of symmetric differences is the Hamming risk whereas the excess mass is the excess risk. They show that kernel based estimators are optimal in the minimax sense over a Hölder class of densities and under a margin type assumption (Polonik, 1995; Tsybakov, 2004). In particular, they derive fast rates of convergence, that is faster than , for the excess mass. In the level set estimation problem, the threshold is chosen beforehand; whereas in our work, the threshold depends on the distribution of the data which makes the statistical analysis more difficult.
On the other part, the confidence set estimation problem is directly related to the standard classification settings. This problem has been widely studied from a theoretical point of view in the binary classification framework. Audibert and Tsybakov (2007) study the statistical performance of plug-in classification rules under assumptions which involve the smoothness of the regression function and the margin condition. In particular, they derive fast rates of convergence for plug-in classifiers based on local polynomial estimators (Stone, 1977; Tsybakov, 1986; Audibert and Tsybakov, 2007) and show their optimality in the minimax sense. One of the aim of present work is to extend these results to the confidence set classification framework.
Another part of our work is to provide a comparison between supervised and semi-supervised procedures. Semi-supervised methods are studied in several papers (Vapnik, 1998; Rigollet, 2007; Singh, Nowak and Zhu, 2009; Bellec et al., 2018) and references therein. A simple intuition can be provided on whether one should or not expect a superior performance of the semi-supervised approach. Imagine a situation when the unlabeled sample is so large that one can approximate up to any desired precision, then, if the optimal decision is independent of , the semi-supervised estimators are not to be considered superior over the supervised estimation. This is the case in a lot of classical problems of statistics, where the inference is solely governed by the behavior of the conditional distribution (for instance regression or binary classification). The situation might be different once the optimal decision relies on the marginal distribution . In this case, as suggested by our findings, the semi-supervised approach might or not outperform the supervised one even in the context of the same problem. Similar conclusions were stated by Singh, Nowak and Zhu (2009) in the context of learning under the cluster assumption (Rigollet, 2007).
1.5 Main contributions
Bellow we summarize our contributions.
- •
Our results focus on the case where the regression belongs to a Hölder class and satisfy the margin condition. Under these assumptions, we establish lower bounds on the minimax rates, defined in Section 1.3 in the confidence set framework.
- •
As important consequences of our results, we first show that top- type procedures are in general inconsistent. Furthermore, by providing a rigorous definition of the semi-supervised and supervised estimators, we describe the situations when the semi-supervised estimation should be considered superior to its supervised counterpart. Interestingly, our analysis suggests that these regimes are governed by the interplay of the family of distributions and by the considered measure of performance. Besides, we show that in our settings supervised procedures cannot achieve fast rates, that is, its rate cannot be faster than . In contrast, some other classical settings (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009; Herbei and Wegkamp, 2006) allow to achieve faster rates for supervised methods.
- •
We provide supervised and semi-supervised estimation procedures, which are optimal or optimal up to an extra logarithmic factor. Importantly, our results show that semi-supervised plug-in procedure based on local polynomial estimators can achieve fast rates, provided that the size of the unlabeled samples is large enough.
- •
Finally, we perform a numerical evaluation of the proposed plug-in algorithms against the top- counterparts. This part supports our theoretical results and empirically demonstrates the reason to consider more involved procedures.
1.6 Organization of the paper
The paper is organized as follow. In Section 2, we put some additional notation and introduce the family of distributions that we consider. Section 3 is devoted to the lower bounds on the minimax rates and their implications. In Section 4 we introduce the proposed algorithm, establish upper bounds for it, and evaluate its numerical performance. We conclude this paper by Sections 5 and 6 where we discuss and sum-up our results.
2 Class of confidence sets
First let us introduce some generic notation that is used throughout this work. For two numbers we denote by (resp. ) the maximum (resp. minimum) between and . For a positive real number we denote by (resp. ) the largest (resp. the smallest) non-negative integer that is less than or equal (resp. greater than or equal) to . The standard Euclidean norm of a vector is denoted by and the standard Lebesgue measure is denoted by . A Euclidean ball centered at of radius is denoted by . For an arbitrary Borel measure on that is absolutely continuous *w.r.t. *the Lebesgue measure we denote by its support, that is, the set where the Radon-Nikodym derivative of w.r.t. is strictly positive. For a vector function and a Borel measure on we define the infinity norm of as
[TABLE]
In this work or its lower-cased versions always refer to some constants which might different from line to line. Importantly, all these constants are independent of but could depend on and other parameters which are assumed to be fixed. Before introducing the families of distributions that are considered in this work we need the following definitions.
Assumption 2.1** (-margin assumption).**
We say that the distribution of the pair satisfies -margin assumption if there exists and such that for every positive
[TABLE]
Let us point out an important consequence of Assumption 1.1. We have that the condition
[TABLE]
for all is equivalent to Assumption 2.1. Indeed, since the random variables ’s cannot concentrate at a constant level, in particular at . Moreover, again due to the continuity Assumption 1.1 we have
[TABLE]
thus the -margin Assumption 2.1 specifies the rate of this convergence. Finally, the restriction of the range of to in -margin Assumption 2.1 does not affect its global behavior as for all
[TABLE]
Let and be two positive constants. We say that a Borel set is a -regular set if
[TABLE]
Definition 2.2** (Strong density).**
We say that the probability measure on satisfies the -strong density assumption if it is supported on a compact -regular set and has a density w.r.t. the Lebesgue measure such that for all and
[TABLE]
Definition 2.3** (Hölder class, Tsybakov (2008)).**
We say that a function is -Hölder for and if is times continuously differentiable and we have
[TABLE]
where is the Taylor polynomial of degree of at the point . Consequently, the set of all functions from to satisfying the above conditions is called -Hölder and is denoted by .
Definition 2.4**.**
We denote by a set of joint distributions on which satisfies the following conditions
- •
the marginal satisfies the -strong density,
- •
for all the regression function belongs to the -Hölder class, that is for all ,
- •
for all the regression function satisfy the -Margin assumption,
- •
for all , the cumulative distribution function of is continuous.
The family of distributions is similar to the one considered in (Audibert and Tsybakov, 2007) in the context of binary classification. The only major difference is the continuity Assumption 1.1, which does not allow to re-use in a straightforward way their construction for lower bounds.
3 Lower bounds
The main results in the present work are the lower bounds we provide in this section. In particular, we establish in Section 3.1 the inconsistency of top- procedures (see Eq. (1.3) for a definition of the method). Therefore more elaborate methods are required in this framework. As pointed out in the introduction, we distinguish two types of estimators: supervised and semi-supervised for which we provide lower bounds in Section 3.2. The obtained rates highlight the benefit of the semi-supervised approach in the context of the confidence set classification.
Before considering the lower bounds, let us first display connection between the different minimax rates. Such links are used in the proofs of the lower bounds.
Proposition 3.1**.**
Let be a measurable function from to , and assume that Assumption 1.1 is fulfilled, then
[TABLE]
Furthermore, if additionally Assumption 2.1 is satisfied with , then there exist which depends only on such that for any pair of confidence set classifiers it holds that
[TABLE]
Proposition 3.2**.**
For any , and the following relation between minimax rates holds:
[TABLE]
Proposition 3.1, and in particular Eq. (3.1) gives an easy way to establish a lower bound on via a lower bound on the Hamming distance . However, this approach does not allow to get (resp. ) part of the rate in the lower bound of (resp. ). Besides, Proposition 3.2 allows to prove a lower bound on the discrepancy with the correct rate via the lower bound on the excess risk .
3.1 Inconsistency of the top- procedure
Before stating our results on the supervised and the semi-supervised estimators, we discuss another interesting class of confidence sets, which might be a natural choice at the first sight. We consider estimators which consists of classes at every point since such estimators naturally satisfy . Let us denote by the set of all estimators such that for all , that is,
[TABLE]
Despite an obvious restriction on the cardinal of the confidence sets, the family of estimators is rather broad. Indeed, every procedure which estimates the regression functions ’s and includes the top scores as the output are included in . The nature of the estimator can also be different, that is, the estimates could be based on the ERM, non-parametric or parametric approaches. Clearly, the family is neither included in nor in and has a non-trivial intersection with both. The next result states that there is no uniformly consistent estimator over the family of distributions .
Proposition 3.3**.**
Assume that , and , then for all we have
[TABLE]
The proof builds an explicit construction of a distribution whose -Oracle satisfies for all in some with . Indeed, if such a distribution exists then there is no estimator in that would consistently estimate this -Oracle. The negative result established in Proposition 3.3 is rather instructive by itself as it advocates that a more involved estimation procedure ought to be constructed.
3.2 Supervised vs semi-supervised estimation
Clearly, estimators which achieve the infimum in the minimax rates are either supervised or semi-supervised, thus a lower bound on together with a lower bound on yield a lower bound on . However, a lower bound on does not discriminate between the supervised and the semi-supervised estimators.
Theorem 3.4** (Supervised estimation).**
Let , . If , then there exist constants such that for all
[TABLE]
Based on this results we observe that the lower bound for the Hamming risk is slower than those for the other risks. It is even more significant that the best rate that a supervised estimator can achieve for all of the risks is even if the margin assumption holds. This is the major difference with the classical settings where the value of threshold is known (such as classification and level set estimation). Indeed, under the same assumptions on the family of distributions, besides the continuity Assumption 1.1, the minimax rate in those frameworks is as proved for instance in (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009). Next theorem deals with semi-supervised procedures and displays another behavior.
Theorem 3.5** (Semi-supervised estimation).**
Let , . If , then there exist constants such that for all
[TABLE]
First, observe that the lower bound for the Hamming distance is, as in the supervised setting, worse than for the other measures of performance. However there is a major difference with the supervised case: as compared to Theorem 3.4, it is possible for a semi-supervised estimator to achieve rates that are faster than if the size of the unlabeled dataset is large enough. In particular, when we consider or the following relations are necessary to get fast rates
[TABLE]
In this case, we recover the same fast rates as in the classical settings of classification and level set estimation. It suggests that the lack of knowledge of the threshold does not alter the quality of estimation for the semi-supervised procedure, provided that is sufficiently large. Next corollary makes these observations clearer.
Corollary 3.6**.**
Assume that the rates in Theorem 3.5 (resp. Theorem 3.4) are minimax, that is, there exist a confidence set (resp. ) that achieves these rates. Regarding and the following conclusions hold
- •
There is no semi-supervised estimator that achieves faster rate than if:
[TABLE]
- •
The rate of is faster than the rate of any supervised estimator if:
[TABLE]
Moreover, if there exists such that , then the rate of is polynomially faster than .
- •
The rate of is fast similarly to the classical frameworks if
[TABLE]
Clearly, similar observation is true for the Hamming risk ; however the regime when improvement is possible thanks to semi-supervised approaches is narrowed as . We summarize Corollary 3.6 in Table 3.2.
Essentially, the above results suggest that the advantage of the semi-supervised approaches over the supervised ones depends not only on the underlying family of distributions but also on the metric that is considered. Yet, necessary and sufficient conditions that must be imposed in general on the problem and the metric so that the semi-supervised estimation provably improve upon the supervised one remain an open problem.
A final remark we could make before going further concerns the assumption on the parameters and . The condition in the lower bounds is slightly more restrictive than the conditions given in (Audibert and Tsybakov, 2007) (they have ). We believe that this is an artifact of our proof and could be avoided with a finer choice of hypotheses. Simple modifications of the lower bound of Audibert and Tsybakov (2007) do not work in our settings because their hypotheses are not satisfying Assumption 1.1. In contrast, the construction of Rigollet and Vert (2009) satisfies111Modified properly to fit the classification framework. Assumption 1.1 but their lower bound is limited by the condition , that is, it does not cover the fast rates as long as the dimension .
3.3 Sketch of the proof
In order to prove the lower bounds of Theorems 3.4, 3.5 we actually prove two separate lower bounds on the minimax rates. The two lower bounds that we prove are naturally connected with the proposed two-step estimator in Eq. (1.5), that is, the first lower bound is connected with the problem of non-parametric estimation of for all and the second describes the estimation of the unknown threshold .
In particular, the first lower bound is closely related to the one provided in (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009), however, the continuity Assumption 1.1 makes the proof more involved and results in a final construction of hypotheses that differs significantly. This part of our lower bound relies on Fano’s inequality in the form of Birgé (2005). The second lower bound is based on two hypotheses testing and is derived by constructing two different marginal distributions of which are sufficiently close and a fixed regression function . Crucially, these marginal distributions admit two different values of threshold and thus two different -Oracle. In this part we make use of Pinsker’s inequality, see for instance (Tsybakov, 2008).
In order to discriminate the supervised and the semi-supervised procedures we make use of Definition 1.4. Notice that every supervised procedure thanks to Definition 1.4 is not “sensitive” to the expectation taken *w.r.t. *the unlabeled dataset , that is, randomness is only induced by the labeled dataset . This strategy allows to eliminate the dependence of the lower bound on the size of the unlabeled dataset for supervised procedures. Informally, the lower bound on is obtained from the lower bound on by setting .
4 Upper bounds
In this section, we show that we can build confidence set estimators that achieve, up to a logarithmic factor, the lower bounds stated in Theorems 3.4-3.5. In other words, those estimators are nearly optimal in the minimax sense. To come straight to the point, we delay the construction of the estimators to Section 4.1 and their properties to Section 4.2, and focus right now on their upper bounds.
Theorem 4.1** (Supervised estimation).**
Let , , then there exists a supervised estimator and a constant such that for all we have
[TABLE]
Theorem 4.2** (Semi-supervised estimation).**
Let , , then there exists a semi-supervised estimator and constants such that for all we have
[TABLE]
We show here that the lower bounds of Theorems 3.4-3.5 are achievable. In particular, in the case of Hamming risk, the upper bounds are optimal; whereas for the Excess risk and the Discrepancy, the upper bounds fit the lower bounds up to a logarithmic factor. Thus, the comments we made in Corollary 3.6 are correct. Let us mention that the presence of the logarithmic factor in these upper bounds is due to -norm estimation (see Lemma 4.5).
Hamming risk as a measure of performance was considered in the settings of Sadinle, Lei and Wasserman (2018). They also establish upper bounds for this measure, though do not assess their optimality. Besides, as we already mentioned, Denis and Hebiri (2017) provide an upper bound on the excess risk in the context of ERM. Let us point out, that the comparison with these two works is not fair as the assumptions and even frameworks under which we and they formulate results are different.
4.1 Construction of the estimators
Building estimators and that reach the rates in the former upper bounds involves a preliminary estimators of the regression functions , . These estimators are constructed using an arbitrary half of the labeled dataset and they satisfy the following assumptions.
Assumption 4.3** (Exponential concentration).**
There exist estimators for all based on and positive constants such that for all and all we have for all
[TABLE]
*for almost all w.r.t. . *
Assumption 4.4** (Continuity of CDF).**
For all the cumulative distribution function of is almost surely continuous on .
First let us point out that Assumption 4.3 induces that there exists a constant such that for all and all
[TABLE]
Assumption 4.3 is commonly used in the statistical community when we deal with rates of convergence in the classification settings (Audibert and Tsybakov, 2007; Lei, 2014; Sadinle, Lei and Wasserman, 2018). It is for instance satisfied by the locally polynomial estimator (Stone, 1977; Tsybakov, 1986; Audibert and Tsybakov, 2007). Assumption 4.4 can always be satisfied by slightly processing any estimator . Indeed, assume Assumption 4.4 fails to be satisfied by some estimator . It means that there exists a subset of of non-zero measure such that at least one , with , is constant on this set. Then, if we add a deterministic continuous function of a sufficiently bounded variation222It is sufficient to make sure that adding the function preserves its statistical properties, that is, Assumption 4.3 to such regions can no longer exist.
Since, the threshold level is not known beforehand, it ought to be estimated using data. A straightforward estimator of this threshold can be constructed using the unlabeled dataset . To make our presentation mathematically correct we introduce the following notation , where is the dataset used to build the estimators for . Now, all the labels are removed from , that is it consists of *i.i.d. *samples from . The supervised and semi-supervised estimators of are defined as
[TABLE]
respectively. Finally, we are in position to define and as
[TABLE]
for all . Note that is clearly supervised in the sense of Definition 1.4, as it is independent of the unlabeled sample . In contrast, is semi-supervised, since we can find two samples and which induce different confidence sets. To show that the estimators introduced in this section satisfy the statements of Theorems 4.1-4.2 we refine the proof technique used in (Denis and Hebiri, 2017). That is, we introduce an intermediate quantity
[TABLE]
and the associated confidence set, which we refer to as the pseudo Oracle confidence set given for all by
[TABLE]
The confidence set assumes knowledge of the marginal distribution and is seen as an idealized version of both and , note however, that the pseudo Oracle is not an estimator.
4.2 Properties of the plug-in confidence sets
An important step of our analysis is the following lemma, that bounds the difference between and .
Lemma 4.5** (Upper bound on the thresholds).**
Let Assumption 1.1 be satisfied, then for all
[TABLE]
The proof of Lemma 4.5 uses elementary properties of the generalized inverse functions which are provided in Appendix. Besides, let us mention, that the difference resembles the Wasserstein infinity distance which gives an alternative approach to prove Lemma 4.5, see (Bobkov and Ledoux, 2016). Lemma 4.5 explains the extra factor that appears in the upper bound, as the minimax estimation in sup norm contains the factor, see for instance (Stone, 1982; Tsybakov, 2008). Another important property of the introduced estimators and is obtained via Assumption 4.4. It describes the deviation of the information of and from the desired level .
Proposition 4.6** (Denis and Hebiri (2017)).**
Let for all be arbitrary estimators of the regression functions constructed using that satisfies Assumption 4.4, then there exist constants such that for all it holds that
[TABLE]
Note that if satisfies Assumption 4.4 for all , then . This simple fact is a step in the proof of Proposition 4.6. Finally, combination of Lemma 4.5, Proposition 4.6, Assumption 4.3 with the peeling argument used in (Audibert and Tsybakov, 2007, Lemma 3.1) yields the results of Theorems 4.1-4.2.
4.3 Simulation study
The goal of this part is to numerically address the following points.
Is it more advantageous to go outside of the classical multi-class classification settings and consider the confidence set framework? To respond to this question we compute the Bayes optimal multi-class classifier and view it as a confidence set with one label. We compare this Bayes rule with the -Oracle in terms of the error using various values of and .
- 2)
How does the -Oracle confidence set compares to another ”Oracle” (- Oracle) which simply includes classes corresponding to the largest values of ’s?
- 3)
Does the proposed plug-in approach indeed gives a good approximation of the -Oracle through the error and the information ?
- 4)
Despite demonstrating the minimax inconsistency of the top- approach, we wonder whether in some scenarios it can achieve a comparable performance against our semi-supervised plug-in procedure.
We consider two simulation schemes depending on the parameter . For each , we generate according to a mixture model. More precisely,
- i)
the label follows uniform distribution on ;
- ii)
conditional on , the feature is generated according to a multivariate gaussian distribution with mean and identity covariance matrix.
For each , the vectors are *i.i.d. *realizations of uniform distribution on . For this distribution, we have
[TABLE]
where for each , is the density function of a multivariate gaussian distribution with mean parameter and identity covariance matrix.
For each , the missclassification error of the classical multi-class classification Bayes rule is evaluated based on a sufficiently large dataset. It is valued at and at for and for respectively. These values are relatively high, which suggests that confusion is induced by the large number of classes. Hence, it is reasonable to apply the confidence set approach to this problem.
In the sequel, we aim at providing the estimation of the error of the -Oracle. To this end, for and each , we repeat times the following steps.
- i)
simulate two datasets and with and ; 2. ii)
based on , we compute the empirical counterpart of and provide an approximation of the -Oracle given in Eq. (1.1) (we recall that this step requires a dataset which contains only unlabeled features); 3. iii)
finally, over , we compute the empirical counterparts (of ) and (of ).
From this estimates, we compute the mean and the standard deviation of and . Tables 2 and 3 present values of the error and of the information which are achieved by the -Oracle and by the - Oracle.
We now move towards the construction of our semi-supervised plug-in estimators . For each and each , we evaluate the performance of according to three different estimations of the regression function: the ’s are based on random forests, softmax regression and deep learning procedures. Let us point out, that for random forests and softmax regression algorithms, the random variables appear to be not continuous. Hence Assumption 4.4 is violated. To alleviate this issue, we add to an independent small perturbation for simplicity. The evaluation of the performance of relies on the following steps
- i)
simulate three datasets , and ; 2. ii)
based on , we compute the estimators of according to the considered procedure; 3. iii)
based on and we compute the function and the estimator as in Eq. (1.5) (we recall that this step requires a dataset which contains only unlabeled features); 4. iv)
finally, we compute over the empirical counterpart of and of for the considered .
Again, during these experiments, we compute means and standard deviations. The parameters are fixed as follows: for , we fix and ; for we fix and . Finally, the size of is fixed to . The results are illustrated in Tables 4 and 5.
As benchmark for the continuation of our experiments, the classical missclassification errors of the multi-class classifiers based on random forests, softmax regression and deep learning methods are valued respectively at , , for , and at , for .
Turning to Table 2 we confirm the intuition that the error of the -Oracle decreases as the value of the parameter increases. Nevertheless, for moderate values of , compared to , we obtain a satisfactory improvement compared to standard multi-class classification Bayes rule. For instance, when and the error of the -Oracle confidence set is , whereas the Bayes classifier has ; likewise, when and the the classification error decreases from to . Table 2 shows that the - Oracle is slightly outperformed by the -Oracle in terms of the error, but still performs well.
From Tables 3 and 5, we observe that the approximation of the information is reasonably good and it gets better with the number of unlabeled data. Besides, Tables 2 and 4 demonstrate that our algorithm is sensitive to the choice of the underlying estimator . Indeed, when is estimated via the softmax regression, our algorithm fails to give a good approximation to the error of the -Oracle.
Table 4 provides similar conclusions regarding , though, unlike the theoretical quantities, there are more scenarios where our method is better than its - counterpart. Let us point out, that for methods that are based on the softmax regression perform poorly in this setup.
5 Discussions
5.1 Around continuity Assumption 1.1
The bedrock of this paper is Assumption 1.1. Based on it, we ensure that the -Oracle confidence set given by Eq. (1.1) is indeed of information . On top of that, the explicit formulation of excess risk in Proposition 3.1 relies on the continuity of function . Should Assumption 1.1 fail to be satisfied, then there might be no -Oracle given by thresholding on some level . Indeed, assume Assumption 1.1 is not satisfied but one can build a -Oracle having the form with some , then
[TABLE]
However, without the continuity, the function is not surjective and therefore, the equation may have no solution, which contradicts the fact that . Therefore, the settings without the continuity of deserve a separate study. Let us also point out that the continuity assumption implies that the -Oracle can also be defined as
[TABLE]
where the inequality used in place of the equality. Indeed, under continuity assumption thanks to Propositions 1.3 and 3.1 we have for all confidence sets such that
[TABLE]
which implies that the -Oracle is a minimizer.
5.2 Around Lipschitz continuity of
Under the assumptions needed in this work, and in particular the continuity assumption we showed two important facts: i) no supervised approach can achieve fast rates, that is, faster than ; ii) some semi-supervised approaches can achieve fast rate.
One might wonder whether extra assumptions on the problem allow a supervised method to get faster rates than . We give to this question a partial answer following the recent work of Bobkov and Ledoux (2016) and more precisely their Theorem 5.11. Applying this result to our framework, we can state that there exists a positive constant such that
[TABLE]
where is the Lipschitz constant of and is the generalized inverse of
[TABLE]
If, on top of the above, one can show that for any and for some positive constant
[TABLE]
then under Lipschitz continuity of , we can prove that
[TABLE]
where stands for or . This would illustrate that both and are statistically equivalent under Lipschitz condition on , that is, both reach the same rate and the impact of the unlabeled data is negligible. We plan to further investigate the influence of this Lipschitz condition on the minimax rates of convergence in our future works. Since in the present contribution we do not impose this assumption on , the upper bound of Bobkov and Ledoux (2016) is not applicable and we had to rely on a different approach.
5.3 Around extra logarithm
Theorems 3.5 and 4.2 demonstrate that for the excess risk and the Discrepancy, the upper and the lower bounds differ by a logarithmic factor. As we have already pointed out, this factor appears in the upper bounds due to Lemma 4.5 which relates the difference between two thresholds to the infinity norm. One might hope that if we manage to replace the infinity norm by any other -norm on the right hand side of the inequality in Lemma 4.5 this logarithm can be eliminated. Unfortunately, it appears that this bound is actually tight, in a sense that one can construct a distribution and an estimator for all such that an equality is achieved in Lemma 4.5. These arguments suggest that the obtained upper bound should be optimal. They also imply that the lower bounds could be further refined to get an extra logarithmic factor. Let us also mention that the continuity Assumption 1.1 in combination with the margin Assumption 2.1 are main obstacles that did not allow us to provide better lower bounds. Nevertheless, our proofs are already involved and our results allow to make non trivial conclusions even without going into the details concerning the logarithms.
6 Conclusion
In this work we have studied the minimax settings of confidence set multi-class classification. First of all, following previous works we have shown that a top- type procedure is inconsistent in our settings and more involved estimators should be proposed. Besides, we have demonstrated that no supervised estimator can achieve rates that are faster than , which stays in contrast with other classical settings. Additionally, we have shown that fast rates are achievable for semi-supervised techniques provided that the size of the unlabeled sample is large enough. Consecutively, we have established that our lower bounds are either optimal or nearly optimal by providing a supervised and a semi-supervised estimators which are tractable in practice. Our future works shall be focused on the Lipschitz condition of discussed in Section 5.2, in particular, we want to understand how this extra assumption affects our lower bounds.
Appendix A Technical results
Here we provide proofs for our result. This Appendix is composed of the following part: in Appendix A we introduce some technical results used for our proofs; Appendix B is devoted to the proofs of the upper bounds; Appendix C provides with the proofs our our main lower bounds; finally, in Appendix D we prove the inconsistency of top- approaches.
In this section we gather several technical results which are used to derive the contributions of this work. Let us start by introducing notation used in the appendix. Given any two probability measures on some space measurable space the Kullback–Leibler divergence between and is defined as
[TABLE]
and the total variation distance is defined as
[TABLE]
We start with Fano’s inequality in the form proved by (Birgé, 2005).
Lemma A.1** (Fano’s inequality (Birgé, 2005)).**
Let be a finite family of probability measures on and let be a finite family of disjoint events such that for each . Then,
[TABLE]
Lemma A.2** (Pinsker’s inequality).**
Given any two probability measures on some measurable space we have
[TABLE]
Lemma A.3** (Hoeffding’s inequality (Hoeffding, 1963)).**
Let be a real number, and be a positive integer. Let be random variables having values in , then
[TABLE]
Proposition A.4** (Properties of the generalized inverse).**
Let and be a Borel measure on , let be a vector function, we define for all and all
[TABLE]
Then,
- •
for all and we have .
- •
if for all the mappings are continuous on , then
- –
for all we have .
The next result is an analogue of the classical inverse transform theorem (van der Vaart, 1998, Lemma 21.1) and was already established by Denis and Hebiri (2017).
Lemma A.5**.**
Let distributed from a uniform distribution on and , real valued random variables independent from , such that the function defined as
[TABLE]
is continuous. Consider random variable and let be distributed according to the uniform distribution on . Then
[TABLE]
where denotes the generalized inverse of .
Proof.
First we note that for every , . Moreover, we have
[TABLE]
To conclude the proof, we observe that
[TABLE]
∎
Appendix B Upper bounds
In this section we prove Theorems 4.1 and 4.2. It will be clear from our analysis that the proof of Theorem 4.1 follows directly from Theorem 4.2 by setting in the statement of Theorem 4.2. Thus, in this section for simplicity we omit the subscript from . Recall that our dataset consists of three parts . The set is used to construct an estimator of the regression function , that is, is independent from both . The other two sets are used in a semi-supervised manner to estimate the threshold, that is, we erase the labels from . Let , and also recall the definition of the proposed semi-supervised estimator for a given
[TABLE]
with satisfying Assumptions 4.4, 4.3 for all . Moreover, defined as the generalized inverse of
[TABLE]
where . Additionally, recall that the -Oracle is given as
[TABLE]
where is the generalized inverse of
[TABLE]
Lastly, let us re-introduce an idealized version of the proposed estimator which ’knows’ the marginal distribution of the feature vector as
[TABLE]
with , conditionally on the data. The following result is needed to relate the threshold of to the true value of the threshold .
Lemma B.1** (Upper-bound on the thresholds).**
Let and be a Borel measure on . For two vector functions , we define
[TABLE]
If for all the mapping is continuous on , then for every
[TABLE]
Proof.
The proof of this result is very similar to the proof of (Bobkov and Ledoux, 2016, Theorem 2.12). We start by defining the following quantity
[TABLE]
Due to the definition of we have that for all
[TABLE]
that is, applying Proposition A.4 to the second inequality we get for all
[TABLE]
thus, for with thanks to Proposition A.4 we get
[TABLE]
The inequality is obtained in the same way. Thus, we have proved that
[TABLE]
Finally, notice that for all
[TABLE]
where we used the fact that for all
[TABLE]
and . Therefore by definition of , we can write and we conclude. ∎
We are in position to prove Theorem 4.2, let us point out that the most difficult part in Theorem 4.2 is the upper-bound on the excess risk. The upper-bound on the discrepancy follows the same arguments as the ones we use for the excess-risk.
Excess risk and discrepancy: to upper-bound the excess risk we first separate it into two parts as
[TABLE]
Recall that thanks to Proposition 3.1 we have
[TABLE]
Moreover, let us point out that if some then either
[TABLE]
holds. Thus on the event we have
[TABLE]
Therefore, for using Lemma B.1 and the observations above we can write
[TABLE]
finally, using the margin Assumption 2.1 we get almost surely data
[TABLE]
Integrating over the data from both sides and using Assumption 4.3 we get
[TABLE]
For the following trivial upper-bound holds
[TABLE]
now, thanks to the first property of Proposition A.4 we can write
[TABLE]
To finish our proof we make use of the peeling technique of (Audibert and Tsybakov, 2007, Lemma 3.1). That is, we define for and
[TABLE]
Since, for every , the events are mutually exclusive, we deduce
[TABLE]
Now, we consider uniformly distributed on independent of the data and . Conditional on the data and under Assumption 4.4, we apply Lemma A.5 with , and then obtain that is uniformly distributed on . Therefore, for all and , we deduce
[TABLE]
Hence, for all , we obtain
[TABLE]
Next, we observe that for all
[TABLE]
Thus, we obtain that
[TABLE]
almost surely data. Integrating from both sides with respect to the data we get
[TABLE]
recall that the function for all and is independent from , thus we can write
[TABLE]
Now, since conditional on , is an empirical mean of *i.i.d. *random variables of common mean , we deduce from Hoeffding’s inequality that
[TABLE]
Therefore, treating separately, we get from inequalities of Eqs. (B.3), (B.4), and (B.5)
[TABLE]
Finally, choosing in the above inequality, we finish the proof.
Hamming risk: here we provide an upper bound on the Hamming risk. First, by the triangle inequality we can write for the proposed estimator and the pseudo Oracle set
[TABLE]
Notice that for the term we can re-use the proof technique used for the term in Eq. (B). Thus, it remain to upper-bound the term . The proof on this part closely follows the machinery used in Denis and Hebiri (2017), however, let us mention that they used this method to obtain a bound on the Discrepancy which leads to a sub-optimal rate. Nevertheless, their approach gives a correct rate if instead of the Discrepancy we bound the Hamming distance. For the sake of completeness we write the principal parts of the proof here.
First of all, by the definition of sets and we can write for
[TABLE]
Now if and we can have the following situations
- •
if , then ;
- •
if , then either or ;
Similar conditions are satisfied if and . Using the above arguments we can upper-bound as
[TABLE]
Thanks to the continuity Assumption 4.4 on the estimator and the continuity Assumption 1.1 on the distribution we clearly have . Moreover, we can write
[TABLE]
Thus, our bound reads as
[TABLE]
Finally, in order to upper-bound the term above one can use the peeling argument of Audibert and Tsybakov (2007) applied with the exponential concentration inequality provided by Assumption 4.3. This part of the proof we omit here and refer the reader to Denis and Hebiri (2017) or to Audibert and Tsybakov (2007) for a complete result.
Let us emphasize that the argument above is only possible due to the continuity Assumptions 1.1, 4.4 on the distribution and the estimator respectively.
Appendix C Proof of the lower bounds
This section is devoted to the proof of the lower bounds provided by Theorems 3.4-3.5. Before proceeding to the proofs let us briefly sketch the high-level strategy used in this work. In order to prove the lower bounds of Theorems 3.4-3.5 we actually prove to separate lower bounds on the minimax risk. Clearly, if some non-negative quantity is lower-bounded by two different values, therefore it is lower-bounded by the maximum between the two. The two lower bounds that we prove are naturally connected with the proposed two-steps estimator, that is, the first lower bound is connected with the problem of non-parametric estimation of for all and the second describes the estimation of the unknown threshold .
In particular, the first lower bound is closely related to the one provided in (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009), though, crucially the continuity Assumption 1.1 makes the proof more involved. The second lower bound is based on two hypotheses testing and is derived by constructing two different marginal distributions of and a fixed regression vector . In this part we make use of Pinsker’s inequality recalled in Lemma A.2.
In order to discriminate the supervised and the semi-supervised procedures we make use of Definition 1.4. Notice that every supervised procedure thanks to Definition 1.4 is not ’sensitive’ to the expectation taken *w.r.t. *the unlabeled dataset , that is, randomness is only induced by the labeled dataset . This strategy allows to eliminate the dependence of the lower bound on the size of the unlabeled dataset for supervised procedures. Indeed, let be any supervised estimator in the sense of Definition 1.4, then for any real valued function of confidence sets we have
[TABLE]
with being an arbitrary set of points in .
C.1 Part I:
Here we prove that the rate is optimal for semi-supervised methods, as already mentioned the rate for the supervised methods can be obtained by formally setting . The constant are always assumed to be independent of and can differ from line to line. Let us fix and . For a positive constant we define the following sequence
[TABLE]
To prove the lower bound we construct two distribution and on sharing the same regression function and with different marginals admitting densities . First, for a fixed parameter and fixed constants to be specified we define the following sets
[TABLE]
Let us denote by for the centers of , and . Using these sets we define the regression vector as
[TABLE]
In order to define the functions for we first define a one dimensional function of two real-valued parameters
[TABLE]
Figure 1 illustrates the behavior of function in one dimension. Note that for every the function above is infinitely smooth. Using the definition of we define the functions for as
[TABLE]
and the constant is chosen small enough so that each function for is -Hölder. Let us point out that such value exists and is independent of , indeed, the mapping
[TABLE]
is infinitely smooth, thus it is -Hölder for a properly chosen . Figure 2 demonstrates the behavior of the considered construction in one dimension. Note that for are obtained from the previous mapping by re-scaling which preserves the Hölder constant . Same reasoning applies to for .
Since one can check that the following relations hold true
[TABLE]
which will help us to ensure that the thresholds under are and respectively. Now, we define two marginal distributions by their densities as
[TABLE]
and both are equal to zero in unspecified regions. Clearly, the strong density assumption is satisfied on and since the density is lower and upper-bounded by a constant independent of both . The parameter is chosen such that the strong density assumption on for is satisfied. Notice that
[TABLE]
for some constant independent of , thus we set . For these hypotheses one can easily check that the thresholds and the optimal -sets are given as
[TABLE]
The margin assumption: we are in position to check the margin Assumption 2.1. Let , thus for every and every we have
[TABLE]
moreover for every and every we can write
[TABLE]
Hence, for the [math] hypothesis there exists independent of such that
[TABLE]
Therefore we can write using the strong density assumption
[TABLE]
Finally notice that for every such that we have for some
[TABLE]
which implies that for some positive independent of we can write
[TABLE]
This implies that for as long as (and since we have ) the margin assumption is satisfied. Moreover, these conditions imply that , which we will also require while proving the supervised part of the rate. Same reasoning can be carried out for the case of the first hypothesis on the set .
Finally, the parameters are chosen as constants independent of such that there exists a smooth connection between the parts of the regression functions which are defined on . Notice that such a choice is possible since by the construction of functions for they are zeroed-out on the boundaries of . Thus in the region it is sufficient to construct a function which connects four different constants smoothly. We avoid this over complication on this part and hope that the guidelines provided above are sufficient for the understanding.
Notice that the constructed distributions are satisfying Assumption 1.1 since the measures are only defined on and the regression functions on these sets are not concentrated around any constant.
Before proceeding to the final stage of the proof let us mention that in what follows we use the de Finetti (de Finetti, 1972, 1974) notation which is common in probability. That is, given a probability measure on some measurable space and a measurable function we write
[TABLE]
Bound on the KL-divergence: we start by computing the KL-divergence between and
[TABLE]
Lower bound for the Hamming risk: first of all let us introduce the following notation for
[TABLE]
Recall that we are interested in the following quantity
[TABLE]
since the hypotheses we can write
[TABLE]
where is defined as
[TABLE]
thus, for the Hamming risk we can write
[TABLE]
Now we focus our attention to the sum of two Hamming differences which appearing on the right hand side of the above inequality
[TABLE]
Substituting this lower bound into the initial inequality we arrive at
[TABLE]
which implies the desired lower bound on the Hamming risk.
Lower bound for the excess risk: this part is analogues to the case of the Hamming distance. Let us recall that for every we have the following expression for
[TABLE]
Again, recall that we are interested in
[TABLE]
similarly to the previous case, since the hypotheses we can write
[TABLE]
where is defined as
[TABLE]
we can write
[TABLE]
and we continue in a similar fashion
[TABLE]
since for all we obtain
[TABLE]
then, since for all , we have
[TABLE]
Thus,
[TABLE]
Which concludes the first part of the lower bounds.
C.2 Part II:
In this section we prove that in case of the Hamming risk the rate is minimax optimal. Notice, that thanks to Proposition 3.1 a lower bound of order on the Hamming risk immediately implies a lower bound of order on both and .
The proof is based on the reduction of the Hamming risk to a multiple hypotheses testing problem and an application of Fano’s inequality provided by Birgé (2005) recalled in Lemma A.1.
Assume that and fix some , define the regular grid on as
[TABLE]
and denote by as the closest point to of the grid to the point . Such a grid defines a partition of the unit cube denoted by . Besides, denote by for all . For a fixed integer and for any define , . Additionally we introduce the following set . For every we build the distribution , such that, the marginal distribution is independent of and the regression vector is constructed as
[TABLE]
where , , and are to be specified. The constants are set as
[TABLE]
The function is constructed as
[TABLE]
the function is infinitely many times differentialble, is equal to zero on and to one on . Figure 3 shows the behavior of . Taking the constant big enough independently of we can ensure that the function is -Hölder.
The function is constructed similarly to the previous part of the rate, that is, for we choose
[TABLE]
with being sufficiently small such that is -Hölder and upper-bounded by . For the function we consider the following construction
[TABLE]
where is defined as
[TABLE]
Figure 4 explains the behavior of this function and helps for better understanding of our results. The constant is chosen in such a way that the constructed function is -Hölder and and upper-bounded by . Notice that the function for all satisfies
[TABLE]
Finally, the function is any -Hölder function with sufficiently bounded variation which is not concentrated around any constant, for example
[TABLE]
For chosen small enough to ensure that it is -Hölder and has a bounded by variation.
It remains to define the marginal distribution of the vector . We select a Euclidean ball in denoted by that has an empty intersection with and whose Lebesgue measure is . The density of the marginal distribution of is constructed as
- •
for every and every or ,
- •
for every ,
- •
for every other ,
for some to be specified. Now, we check that the distributions constructed above belong to the set for every . Namely, we check the following list of assumption
- •
The functions are defining some regression function for every . That is, for each we have and ,
- •
the functions are -Hölder,
- •
the function is continuous,
- •
the threshold is equal to for every ,
- •
the marginal distribution satisfies the strong density assumption,
- •
the regression function satisfies -margin assumption.
The regression function is well defined: to see this, notice that for every and every we have by construction
[TABLE]
and the combination of both with implies that . Moreover, as long as for every we have for every
[TABLE]
and by construction of the function we have for every , every and every
[TABLE]
due to the choice of we have
[TABLE]
Similarly, for every , every and every
[TABLE]
and with the choice of specified above and the constraint we have
[TABLE]
Thus, the construction above defines some regression function for every .
The regression function is -Hölder: this implication follows immediately from the construction of .
Continuity of : first let us show that is continuous for every . For the continuity follows from the fact that is not concentrated around any constant. For we first write
[TABLE]
thus for this choice of the continuity follows from the fact that and are not concentrated around any constant.
Threshold : to see this notice that for every ,
[TABLE]
and the condition on the threshold follows from the continuity of . Besides, the corresponding -Oracle sets are given for every as
[TABLE]
The strong density assumption: the strong density assumption can be checked following the proof of (Audibert and Tsybakov, 2007, Theorem 3.5) where an analogous construction of the marginal distribution was considered.
-margin assumption: for all , all and all we have
[TABLE]
thus for the margin assumption is satisfied. It remains to check that the margin assumption is satisfied for . Fix an arbitrary and , then for all we can write
[TABLE]
We separately upper-bound both terms which appear on the right hand side of the equality.
[TABLE]
clearly there exists a constant such that for all we have
[TABLE]
Therefore for some constant we can write
[TABLE]
thanks to the strong density assumption we can write for some
[TABLE]
Thus since and we can write for some
[TABLE]
To finish this part it remains to upper-bound the other term in the margin assumption
[TABLE]
using the fact that the function for all satisfies
[TABLE]
we can write for all
[TABLE]
moreover, for all we can write
[TABLE]
and finally for we can write
[TABLE]
The above implies that for some constant we have for all
[TABLE]
Thus the margin assumption is satisfied as long as
- •
;
- •
.
Similarly one can check that the margin assumption is satisfied for Bound on the KL-divergence: we are in position to upper-bound the KL divergence between any two hypotheses. Fix some , then using the upper bound on we can write for some
[TABLE]
How many hypotheses to take: let us recall the following result which is a version of Varshamov-Gilbert bound (Gilbert, 1952; Varshamov, 1957).
Lemma C.1**.**
Let denote the Hamming distance between given by
[TABLE]
There exists such that for all we have
[TABLE]
and .
Denote the set provided by Lemma C.1 and by the set of distributions with . Taking into account all the above we conclude that satisfies the assumptions of our result.
Lower bound on the Hamming risk (applying Birgé’s Lemma A.1): finally, we are in position to lower bound the hamming risk. Recall that we are interested in the following quantity
[TABLE]
The rest of the proof follows standard arguments, which again using the de Finetti notation read as
[TABLE]
Denote by the following minimizer
[TABLE]
thus if we can write using the definition of and the triangle inequality
[TABLE]
These arguments and Birge’s lemma A.1 imply that
[TABLE]
Since the marginal distribution of the vector is shared among the hypotheses, using the upper-bound on the -divergence and the conditions on we get for some
[TABLE]
Finally, let , and for some small enough we get for some and
[TABLE]
One can easily verify that this choice of parameters is possible as long as and clearly with our choice we have . As already mentioned the lower bound for the excess risk and the discrepancy follows from Propositions 3.1 and 3.2.
Appendix D Inconsistency of top- approach
In this section we prove Proposition 3.3. The proof builds an explicit construction of a distribution whose -Oracle satisfies for all in some with . Clearly, if such a distribution exists then there is no estimator in that would consistently estimate this -Oracle. Let be a fixed integer and . For the proof of the theorem we shall construct one distribution for which none of the estimators with a fixed information can perform well. We start by specifying the marginal distribution of . We start the construction by specifying the density of the marginal distribution . Define a disk in for some positive as . First of all fix some parameters which are independent from . The density is supported on .
Moreover,
- •
for all ,
- •
for all ,
- •
, for all ,
- •
otherwise,
where is chosen small enough to ensure that . The regression function are defined as
[TABLE]
where the constant is chosen small enough to ensure that these functions are -Hölder and have sufficiently small variation. Consider an arbitrary infinitely many times differentiable function which satisfies for all and for all . Then, the functions and are defined as , . The above construction defines a distribution for which we have
[TABLE]
Indeed, let us evaluate the following quantity under the assumption that
[TABLE]
Thus, using this distribution we can write for any classifier with fixed cardinal
[TABLE]
where the first inequality follows from the observation that for there is always at least one label such that . Thus, since the constant is chosen to satisfy we have for any
[TABLE]
If is such that we get
[TABLE]
By construction, the regression vector is -Hölder and the density is lower- and upper-bounded by some positive constants. Hence, it remains to check that the constructed distribution satisfies the -margin assumption. This can be achieved by an appropriate choice of . Indeed, on the sets there is a “corridor” of constant size between the regression functions and the threshold . The threshold is only approached by the regression function on the set . As all the parameters in our construction are independent from we can find a value being small enough so that the -margin assumption is verified for a fixed .
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Anbar (1977) {barticle} [author] \bauthor \bsnm Anbar, \bfnm D \binits D. ( \byear 1977). \btitle A Modified Robbins-Monro Procedure Approximating the Zero of a Regression Function from Below. \bjournal Ann. Statist. \bvolume 5 \bpages 229–234. \endbibitem
- 2Audibert and Tsybakov (2007) {barticle} [author] \bauthor \bsnm Audibert, \bfnm J. -Y. \binits J. and \bauthor \bsnm Tsybakov, \bfnm A. \binits A. ( \byear 2007). \btitle Fast learning rates for plug-in classifiers. \bjournal Ann. Statist. \bvolume 35 \bpages 608–633. \endbibitem
- 3Bartlett and Wegkamp (2008) {barticle} [author] \bauthor \bsnm Bartlett, \bfnm P. \binits P. and \bauthor \bsnm Wegkamp, \bfnm M. \binits M. ( \byear 2008). \btitle Classification with a reject option using a hinge loss. \bjournal J. Mach. Learn. Res. \bvolume 9 \bpages 1823–1840. \endbibitem
- 4Bellec et al. (2018) {barticle} [author] \bauthor \bsnm Bellec, \bfnm P. C. . \binits P., \bauthor \bsnm Dalalyan, \bfnm A. S. . \binits A., \bauthor \bsnm Grappin, \bfnm E \binits E. and \bauthor \bsnm Paris, \bfnm Q \binits Q. ( \byear 2018). \btitle On the prediction loss of the lasso in the partially labeled setting. \bjournal Electron. J. Statist. \bvolume 12 \bpages 3443–3472. \endbibitem
- 5Birgé (2005) {barticle} [author] \bauthor \bsnm Birgé, \bfnm L. \binits L. ( \byear 2005). \btitle A new lower bound for multiple hypothesis testing. \bjournal IEEE Trans. Inform. Theory \bvolume 51. \endbibitem
- 6Bobkov and Ledoux (2016) {barticle} [author] \bauthor \bsnm Bobkov, \bfnm S. \binits S. and \bauthor \bsnm Ledoux, \bfnm M. \binits M. ( \byear 2016). \btitle One-dimensional empirical measures, order statistics and Kantorovich transport distances. \bnote to appear in the Memoirs of the Amer. Math. Soc. \endbibitem
- 7Brown and Low (1996) {barticle} [author] \bauthor \bsnm Brown, \bfnm L \binits L. and \bauthor \bsnm Low, \bfnm M \binits M. ( \byear 1996). \btitle A constrained risk inequality with applications to nonparametric functional estimation. \bjournal Ann. Statist \bvolume 24 \bpages 2524–2535. \endbibitem
- 8Chow (1957) {barticle} [author] \bauthor \bsnm Chow, \bfnm C. -K. \binits C. ( \byear 1957). \btitle An optimum character recognition system using decision functions. \bjournal IRE Transactions on Electronic Computers \bvolume 4 \bpages 247–254. \endbibitem
