Minimax semi-supervised confidence sets for multi-class classification

Evgenii Chzhen (LAMA); Christophe Denis (LAMA); Mohamed Hebiri (LAMA)

arXiv:1904.12527·math.ST·April 30, 2019

Minimax semi-supervised confidence sets for multi-class classification

Evgenii Chzhen (LAMA), Christophe Denis (LAMA), Mohamed Hebiri (LAMA)

PDF

TL;DR

This paper develops semi-supervised confidence set classifiers for multi-class problems, achieving faster convergence rates than supervised methods under certain assumptions, with theoretical guarantees and empirical validation.

Contribution

It introduces a semi-supervised minimax framework for confidence set classification with controlled size, establishing convergence rates and demonstrating superiority over supervised methods.

Findings

01

Semi-supervised estimators outperform supervised ones with enough unlabeled data.

02

Achieves faster convergence rates under margin and Hölder conditions.

03

Empirical results confirm theoretical convergence improvements.

Abstract

In this work we study the semi-supervised framework of confidence set classification with controlled expected size in minimax settings. We obtain semi-supervised minimax rates of convergence under the margin assumption and a H{\"o}lder condition on the regression function. Besides, we show that if no further assumptions are made, there is no supervised method that outperforms the semi-supervised estimator proposed in this work. We establish that the best achievable rate for any supervised method is n^{--1/2} , even if the margin assumption is extremely favorable. On the contrary, semi-supervised estimators can achieve faster rates of convergence provided that sufficiently many unlabeled samples are available. We additionally perform numerical evaluation of the proposed algorithms empirically confirming our theoretical findings.

Tables5

Table 1. Table 1 : This table summarizes observations of Corollary 3.6 for Ψ n , N E subscript superscript Ψ E 𝑛 𝑁 \Psi^{\operatorname{E}}_{n,N} and Ψ n , N D subscript superscript Ψ D 𝑛 𝑁 \Psi^{\operatorname{D}}_{n,N} . Depending on the relations between α , γ , d 𝛼 𝛾 𝑑 \alpha,\gamma,d and N , n 𝑁 𝑛 N,n the semi-supervised approach can significantly improve the rates of convergence.

$\frac{(1 + α) γ}{2 γ + d}$	$N, n$	$SE$ rate	$SSE$ rate	$SSE > SE$
$\leq \frac{1}{2}$	$N \in ℕ$ , $n \in ℕ$	$n^{- \frac{(1 + α) γ}{2 γ + d}}$	$n^{- \frac{(1 + α) γ}{2 γ + d}}$	NO
$> \frac{1}{2}$	$N = 𝒪 (n)$	$n^{- \frac{1}{2}}$	$n^{- \frac{1}{2}}$	NO
$> \frac{1}{2}$	$n = o (N)$	$n^{- \frac{1}{2}}$	$N^{- \frac{1}{2}} ⋁ n^{- \frac{(1 + α) γ}{2 γ + d}}$	YES
$> \frac{1}{2}$	$N = Ω (n^{\frac{2 (1 + α) γ}{2 γ + d}})$	$n^{- \frac{1}{2}}$	$n^{- \frac{(1 + α) γ}{2 γ + d}}$	YES

Table 2. Table 2 : For each of the B = 100 𝐵 100 B=100 repetitions and each model, we derive the estimated errors P M subscript P 𝑀 \operatorname{P}_{M} of the β 𝛽 \beta -Oracle and of the top top \operatorname{top} - β 𝛽 \beta Oracle w.r.t. β 𝛽 \beta . We compute the means and standard deviations (between parentheses) over the B = 100 𝐵 100 B=100 repetitions. Top: the data are generated according to K = 10 𝐾 10 K=10 – Bottom: the data are generated according to K = 100 𝐾 100 K=100 .

$K = 10$
$β$	$β$ -Oracle	$top$ - $β$ Oracle
2	0.05 (0.01)	0.09 (0.01)
5	0.00 (0.00)	0.01 (0.00)

Table 3. Table 3 : For each of the B = 100 𝐵 100 B=100 repetitions and each model, we derive the estimated information levels I M subscript I 𝑀 \operatorname{I}_{M} of the β 𝛽 \beta -Oracle set w.r.t. β 𝛽 \beta . We compute the means and standard deviations (in parentheses) over the B = 100 𝐵 100 B=100 repetitions. Left: the data are generated according to K = 10 𝐾 10 K=10 – Right: the data are generated according to K = 100 𝐾 100 K=100 .

$β$	$K = 10$	$K = 100$
2	2.00 (0.03)	2.00 (0.03)
5	5.00 (0.08)	5.00 (0.06)
10	$\cdot$	10.00 (0.13)
20	$\cdot$	20.02 (0.31)

Table 4. Table 4 : For each of the B = 100 𝐵 100 B=100 repetitions and for each model, we derive the estimated errors P P \operatorname{P} of three different Γ ^ SSE subscript ^ Γ SSE \hat{\Gamma}_{\operatorname{SSE}} ’s w.r.t. β 𝛽 \beta . We compute the means and standard deviations (in parentheses) over the B = 100 𝐵 100 B=100 repetitions. For each β 𝛽 \beta and for each N 𝑁 N , the Γ ^ SSE subscript ^ Γ SSE \hat{\Gamma}_{\operatorname{SSE}} ’s, as well as the top top \operatorname{top} procedures are based on, from left to right, rforest , softmax reg and deep learn , which are respectively the random forest, the softmax regression and the deep learning methods. Top: the data are generated according to K = 10 𝐾 10 K=10 – Bottom: the data are generated according to K = 100 𝐾 100 K=100 .

	$K = 10$
	${\hat{Γ}}_{SSE}$			$top$ - $β$
$β$	rforest	softmax reg	deep learn	rforest	softmax reg	deep learn
2	0.09 (0.01)	0.06 (0.01)	0.09 (0.01)	0.13 (0.01)	0.10 (0.01)	0.13 (0.02)
5	0.01 (0.00)	0.00 (0.00)	0.01 (0.00)	0.02 (0.00)	0.01 (0.00)	0.02 (0.00)

Table 5. Table 5 : For each of the B = 100 𝐵 100 B=100 repetitions and for each model, we derive the estimated information levels I I \operatorname{I} of three different Γ ^ SSE subscript ^ Γ SSE \hat{\Gamma}_{\operatorname{SSE}} ’s w.r.t. β 𝛽 \beta and the sample size N 𝑁 N . We compute the means and standard deviations (in parentheses) over the B = 100 𝐵 100 B=100 repetitions. For each β 𝛽 \beta and each N 𝑁 N , the Γ ^ SSE subscript ^ Γ SSE \hat{\Gamma}_{\operatorname{SSE}} ’s are based on, from left to right, rforest , softmax reg and deep learn , which are respectively the random forest, the softmax regression and the deep learning procedures. Top: the data are generated according to K = 10 𝐾 10 K=10 – Bottom: the data are generated according to K = 100 𝐾 100 K=100 .

	$K = 10$
	$N = 100$			$N = 10000$
$β$	rforest	softmax reg	deep learn	rforest	softmax reg	deep learn
2	2.01 (0.09)	2.01 (0.10)	2.02 (0.11)	2.00 (0.02)	2.00 (0.03)	2.00 (0.03)
5	5.02 (0.18)	4.99 (0.20)	5.00 (0.21)	5.00 (0.06)	5.00 (0.08)	5.00 (0.07)

Equations487

error P (Γ) = P (Y \in / Γ (X)), information I (Γ) = E_{P_{X}} ∣ Γ (X) ∣,

error P (Γ) = P (Y \in / Γ (X)), information I (Γ) = E_{P_{X}} ∣ Γ (X) ∣,

Γ_{β}^{*} \in arg min {P (Γ) : Γ \in Υ \em s.t. I (Γ) = β} .

Γ_{β}^{*} \in arg min {P (Γ) : Γ \in Υ \em s.t. I (Γ) = β} .

G (t) : = k = 1 \sum K (1 - F_{p_{k}} (t)) = k = 1 \sum K P_{X} (p_{k} (X) > t),

G (t) : = k = 1 \sum K (1 - F_{p_{k}} (t)) = k = 1 \sum K P_{X} (p_{k} (X) > t),

Γ_{β}^{*} (x) = {k \in [K] : p_{k} (x) \geq G^{- 1} (β)},

Γ_{β}^{*} (x) = {k \in [K] : p_{k} (x) \geq G^{- 1} (β)},

G^{- 1} (β) : = in f {t \in [0, 1] : G (t) \leq β} .

G^{- 1} (β) : = in f {t \in [0, 1] : G (t) \leq β} .

R_{β} (Γ) = P (Γ) + G^{- 1} (β) I (Γ) .

R_{β} (Γ) = P (Γ) + G^{- 1} (β) I (Γ) .

R_{β} (Γ) - R_{β} (Γ_{β}^{*}) = k = 1 \sum K E_{P_{X}} [∣ p_{k} (X) - G^{- 1} (β) ∣ 1_{{k \in Γ (X) △ Γ_{β}^{*} (X)}}] .

R_{β} (Γ) - R_{β} (Γ_{β}^{*}) = k = 1 \sum K E_{P_{X}} [∣ p_{k} (X) - G^{- 1} (β) ∣ 1_{{k \in Γ (X) △ Γ_{β}^{*} (X)}}] .

\hat{Γ} : n, N \in N ⋃ (R^{d} \times [K])^{n} \times (R^{d})^{N} \to Υ,

\hat{Γ} : n, N \in N ⋃ (R^{d} \times [K])^{n} \times (R^{d})^{N} \to Υ,

\hat{Γ} (x; D_{n}, D_{N}) = \hat{Γ} (x; D_{n}, D_{N}^{'}), a.e. x \in R^{d} w.r.t. the Lebesgue measure .

\hat{Γ} (x; D_{n}, D_{N}) = \hat{Γ} (x; D_{n}, D_{N}^{'}), a.e. x \in R^{d} w.r.t. the Lebesgue measure .

\hat{Γ}_{top} (x) = {σ_{1} (x), \dots, σ_{β} (x)}, \forall x \in R^{d} .

\hat{Γ}_{top} (x) = {σ_{1} (x), \dots, σ_{β} (x)}, \forall x \in R^{d} .

\hat{G} (\cdot) = \frac{1}{⌈ n /2 ⌉} i \in D_{⌈ n /2 ⌉} \sum k = 1 \sum K 1_{{\overset{p}{^}_{k} (X_{i}) \geq \cdot}},

\hat{G} (\cdot) = \frac{1}{⌈ n /2 ⌉} i \in D_{⌈ n /2 ⌉} \sum k = 1 \sum K 1_{{\overset{p}{^}_{k} (X_{i}) \geq \cdot}},

\hat{Γ}_{SE} (x) = {k \in [K] : \overset{p}{^}_{k} (x) \geq \hat{G}^{- 1} (β)}, \forall x \in R^{d} .

\hat{Γ}_{SE} (x) = {k \in [K] : \overset{p}{^}_{k} (x) \geq \hat{G}^{- 1} (β)}, \forall x \in R^{d} .

\hat{G} (\cdot) = \frac{1}{N} i \in D_{N} \sum k = 1 \sum K 1_{{\overset{p}{^}_{k} (X_{i}) \geq \cdot}},

\hat{G} (\cdot) = \frac{1}{N} i \in D_{N} \sum k = 1 \sum K 1_{{\overset{p}{^}_{k} (X_{i}) \geq \cdot}},

\hat{Γ}_{SSE} (x) = {k \in [K] : \overset{p}{^}_{k} (x) \geq \hat{G}^{- 1} (β)}, \forall x \in R^{d} .

\hat{Γ}_{SSE} (x) = {k \in [K] : \overset{p}{^}_{k} (x) \geq \hat{G}^{- 1} (β)}, \forall x \in R^{d} .

Ψ_{n, N}^{H} (\hat{Γ}; P)

Ψ_{n, N}^{H} (\hat{Γ}; P)

Ψ_{n, N}^{E} (\hat{Γ}; P)

Ψ_{n, N}^{D} (\hat{Γ}; P)

Ψ_{n, N}^{□} (\hat{Γ}; P) : = \hat{Γ} \in \hat{Γ} in f Ψ_{n, N}^{□} (\hat{Γ}; P),

Ψ_{n, N}^{□} (\hat{Γ}; P) : = \hat{Γ} \in \hat{Γ} in f Ψ_{n, N}^{□} (\hat{Γ}; P),

Ψ_{n, N}^{□} (\hat{Υ}; P) = Ψ_{n, N}^{□} (\hat{Υ}_{SE}; P) ⋀ Ψ_{n, N}^{□} (\hat{Υ}_{SSE}; P) .

Ψ_{n, N}^{□} (\hat{Υ}; P) = Ψ_{n, N}^{□} (\hat{Υ}_{SE}; P) ⋀ Ψ_{n, N}^{□} (\hat{Υ}_{SSE}; P) .

Γ_{β}^{*} (\cdot) = {k \in [K] : f_{k}^{*} (\cdot) \geq G_{f^{*}}^{- 1} (β)},

Γ_{β}^{*} (\cdot) = {k \in [K] : f_{k}^{*} (\cdot) \geq G_{f^{*}}^{- 1} (β)},

Γ_{a}^{*} \in arg min {I (Γ) : Γ \in Υ \em s.t. P (Γ) \leq a},

Γ_{a}^{*} \in arg min {I (Γ) : Γ \in Υ \em s.t. P (Γ) \leq a},

Γ_{a}^{*} (\cdot) = {k \in [K] : p_{k} (\cdot) \geq t_{a}},

Γ_{a}^{*} (\cdot) = {k \in [K] : p_{k} (\cdot) \geq t_{a}},

t_{a}

t_{a}

Γ_{p} (λ) = {x \in R^{d} : p (x) \geq λ},

Γ_{p} (λ) = {x \in R^{d} : p (x) \geq λ},

∥ p ∥_{\infty, μ} : = in f {C \geq 0 : k \in [K] max ∣ p_{k} (x) ∣ \leq C, \em a.e. x \in R^{d} \em w.r.t. μ} .

∥ p ∥_{\infty, μ} : = in f {C \geq 0 : k \in [K] max ∣ p_{k} (x) ∣ \leq C, \em a.e. x \in R^{d} \em w.r.t. μ} .

P_{X} (0 < p_{k} (X) - G^{- 1} (β) \leq t) \leq C_{1} t^{α} .

P_{X} (0 < p_{k} (X) - G^{- 1} (β) \leq t) \leq C_{1} t^{α} .

P_{X} (p_{k} (X) - G^{- 1} (β) \leq t) \leq C_{1} t^{α},

P_{X} (p_{k} (X) - G^{- 1} (β) \leq t) \leq C_{1} t^{α},

t \to + 0 lim P_{X} (p_{k} (X) - G^{- 1} (β) \leq t) = 0,

t \to + 0 lim P_{X} (p_{k} (X) - G^{- 1} (β) \leq t) = 0,

P_{X} (0 < p_{k} (X) - G^{- 1} (β) \leq t) \leq c_{1} t^{α}, with c_{1} = C_{1} \lor t_{0}^{- α} .

P_{X} (0 < p_{k} (X) - G^{- 1} (β) \leq t) \leq c_{1} t^{α}, with c_{1} = C_{1} \lor t_{0}^{- α} .

Leb (A \cap B (x, r)) \geq c_{0} Leb (B (x, r)), \forall r \in (0, r_{0}], \forall x \in A .

Leb (A \cap B (x, r)) \geq c_{0} Leb (B (x, r)), \forall r \in (0, r_{0}], \forall x \in A .

0 < μ_{m i n} \leq μ (x) \leq μ_{m a x} < \infty, \forall x \in A .

0 < μ_{m i n} \leq μ (x) \leq μ_{m a x} < \infty, \forall x \in A .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Minimax semi-supervised confidence sets for multi-class classification

Evgenii Chzhenlabel=e1][email protected] [

Christophe Denislabel=e2][email protected] [

Mohamed Hebiri label=e3][email protected] [ LAMA, Université Paris-Est – Marne-la-Vallée

Université Paris-Est – Marne-la-Vallée

Cité Descartes, Bâtiment Copernic

5 boulevard Descartes

77454 Marne-la-Vallée cedex 2

E-mail: e2

E-mail: e3

Abstract

In this work we study the semi-supervised framework of confidence set classification with controlled expected size in minimax settings. We obtain semi-supervised minimax rates of convergence under the margin assumption and a Hölder condition on the regression function. Besides, we show that if no further assumptions are made, there is no supervised method that outperforms the semi-supervised estimator proposed in this work. We establish that the best achievable rate for any supervised method is $n^{-1/2}$ , even if the margin assumption is extremely favorable. On the contrary, semi-supervised estimators can achieve faster rates of convergence provided that sufficiently many unlabeled samples are available. We additionally perform numerical evaluation of the proposed algorithms empirically confirming our theoretical findings.

62G05,

62G30, 62H05, 68T10,

multi-class classification,

confidence sets,

minimax optimality,

semi-supervised classification,

keywords:

[class=MSC]

keywords:

\startlocaldefs\endlocaldefs

,

and

t1This work was partially supported by “Labex Bézout” of Université Paris-Est

1 Introduction

Let $K\geq 2$ and $(X,Y)\in\mathbb{R}^{d}\times[K]\vcentcolon=\left\{1,\ldots,K\right\}$ be a random couple distributed according to a distribution $\mathbb{P}$ on $\mathbb{R}^{d}\times[K]$ , where $X\in\mathbb{R}^{d}$ is seen as the feature vector and $Y\in[K]$ as the class. This problem falls within the scope of the multi-class setting where the goal is to predict the label $Y$ for a given feature. Commonly, prediction is performed by a classifier that outputs a single label. However, in the confidence set framework, the objective differs: we aim at predicting a set of labels instead of a single one. This problem has been studied in a few works, and we consider in this contribution the setup put forward by Denis and Hebiri (2017). The essential feature of their perspective is the control of the size of confidence sets in expectation. While they provided a procedure to build confidence sets based on Empirical Risk Minimization (ERM) and established upper bounds, the present work aims at giving a general analysis of the confidence problem in the minimax sense.

1.1 Problem statement

All along the paper, we denote by $\mathbb{P}_{X}$ the marginal distribution of $X\in\mathbb{R}^{d}$ and by $p(\cdot)\vcentcolon=(p_{1}(\cdot),\ldots,p_{K}(\cdot))^{\top}$ the regression function defined for all $k\in[K]$ and all $x\in\mathbb{R}^{d}$ as $p_{k}(x)\vcentcolon=\mathbb{P}(Y=1|X=x)$ . For any sets $A,A^{\prime}\subset[K]$ we denote by $A\triangle A^{\prime}$ their symmetric difference. We assume that two data samples $\mathcal{D}_{n},\mathcal{D}_{N}$ are available. The first sample $\mathcal{D}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}$ consists of $n\in\mathbb{N}$ *i.i.d. *copies of $(X,Y)\in\mathbb{R}^{d}\times[K]$ and the second sample $\mathcal{D}_{N}=\{X_{i}\}_{i=n+1}^{n+N}$ consist of $N\in\mathbb{N}$ *i.i.d. *copies of $X\in\mathbb{R}^{d}$ .

A confidence set classifier $\Gamma$ is a measurable function from $\mathbb{R}^{d}$ to $2^{[K]}\vcentcolon=\left\{A\,:\,A\subset[K]\right\}$ , that is, $\Gamma:\mathbb{R}^{d}\rightarrow 2^{[K]}$ and we denote by $\Upsilon$ the set of all such functions. For any confidence set $\Gamma:\mathbb{R}^{d}\rightarrow 2^{[K]}$ we define its error and its information as

[TABLE]

respectively, where $\mathbb{E}_{\mathbb{P}_{X}}$ stands for the expectation *w.r.t. *the marginal distribution of $X\in\mathbb{R}^{d}$ and $\left|\Gamma(x)\right|$ is the cardinal of $\Gamma$ at $x\in\mathbb{R}^{d}$ .

For a fixed integer $\beta\in[K]$ a $\beta$ -Oracle confidence set $\Gamma^{*}_{\beta}$ is defined as

[TABLE]

The set $\left\{\Gamma\in\Upsilon\,:\,\operatorname{I}(\Gamma)=\beta\right\}$ is always non-empty, as it always contain those confidence sets whose cardinal is equals to $\beta$ for every $x\in\mathbb{R}^{d}$ .

The description of $\beta$ -Oracle confidence set in general situation might be complicated. Hence, we introduce the following mild assumption, which allows to obtain an explicit expression.

Assumption 1.1 (Continuity of CDF).

For all $k\in[K]$ the cumulative distribution function (CDF) $F_{p_{k}}(\cdot)\vcentcolon=\mathbb{P}_{X}(p_{k}(X)\leq\cdot)$ of $p_{k}(X)$ is continuous on $(0,1)$ .

Proposition 1.2 ( $\beta$ -Oracle confidence set).

Fix $\beta\in[K-1]$ , and let the function $G:[0,1]\rightarrow[0,K]$ be defined for all $t\in[0,1]$ as

[TABLE]

then under Assumption 1.1 a $\beta$ -Oracle confidence set $\Gamma^{*}_{\beta}$ can be obtained as

[TABLE]

where we denote by $G^{-1}$ the generalized inverse of $G$ defined for all $\beta\in[0,K]$ as

[TABLE]

Proposition 1.3.

Assume that Assumption 1.1 is fulfilled, then the $\beta$ -Oracle defined in Eq. (1.1) is a minimizer of the following risk

[TABLE]

These propositions have been proven in (Denis and Hebiri, 2017, Proposition 4 and Proposition 7). Consequently, the accuracy of a confidence set $\Gamma$ can be for instance quantified according its excess risk

[TABLE]

The statistical learning problem is then to estimate $\Gamma_{\beta}^{*}$ given the data sample $\mathcal{D}_{n}$ and $\mathcal{D}_{N}$ . The formulation in Eq. (1.1) of the $\beta$ -Oracle appears to be closely related to the level set estimation problem (Hartigan, 1987; Polonik, 1995; Tsybakov, 1997; Rigollet and Vert, 2009). Hence at first sight, the introduction of an unlabeled sample may be surprising. However, in our setup the estimation of the $\beta$ -Oracle does not only rely on the regression function but also on the threshold $G^{-1}(\beta)$ which is unknown beforehand and can be estimated in a semi-supervised way (Denis and Hebiri, 2017). To fix these ideas, we give some examples of possible estimation procedures of $\Gamma_{\beta}^{*}$ .

1.2 Confidence set estimators

An estimator $\hat{\Gamma}$ is a measurable function that maps any given data samples into a confidence set classifier. We shall distinguish two types of estimators: supervised and semi-supervised whose formal definition is provided below.

Definition 1.4 (Supervised and semi-supervised estimators).

A measurable mapping

[TABLE]

is called a supervised estimator if for any $n,N\in\mathbb{N}$ and any data samples $\mathcal{D}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}$ , $\mathcal{D}_{N}=\{X_{i}\}_{i=n+1}^{n+N}$ , and $\mathcal{D}^{\prime}_{N}=\{X^{\prime}_{i}\}_{i=n+1}^{n+N}$ it holds that

[TABLE]

Otherwise the estimator is called semi-supervised. In the sequel, for the simplicity of notation we write $\hat{\Gamma}(x)$ instead of $\hat{\Gamma}(x;\mathcal{D}_{n},\mathcal{D}_{N})$ where no ambiguity is present.

Intuitively, the supervised estimators do not take into account the information that is provided by the unlabeled sample. Besides, if we denote by $\hat{\Upsilon}$ the set of all estimators, Definition 1.4 generates a natural partition of $\hat{\Upsilon}$ into two disjoint sets: the supervised estimators $\hat{\Upsilon}_{\operatorname{SE}}$ and the semi-supervised estimators $\hat{\Upsilon}_{\operatorname{SSE}}$ .

Hereafter, we provide three different examples of estimation procedures which are the core of our study. All these methods rely on plug-in principle.

•

Top- $\beta$ procedure. This method is the most intuitive estimator in the considered context. It is a supervised procedure, that is, based only on $\mathcal{D}_{n}$ . Let consider an estimator $\hat{p}$ of the regression function $p$ . Let $\left(\hat{p}_{\sigma_{k}(X)}\right)_{k\in[K]}$ be the order statistic associated to $\hat{p}(X)$ , such that for all $x\in\mathbb{R}^{d}$ we have $\hat{p}_{\sigma_{1}(x)}(x)\geq\ldots\geq\hat{p}_{\sigma_{K}(x)}(x)$ . A top- $\beta$ confidence set is then defined as

[TABLE]

•

Supervised procedure. Formally, in this type of methods, we only care about $\mathcal{D}_{n}$ (we forget about $\mathcal{D}_{N}$ ). We split $\mathcal{D}_{n}$ into two independent samples such that $\mathcal{D}_{n}=\mathcal{D}_{\lfloor n/2\rfloor}\bigcup\mathcal{D}_{\lceil n/2\rceil}$ . Based on the first sample $\mathcal{D}_{\lfloor n/2\rfloor}$ , we consider an estimator $\hat{p}$ of the regression function $p$ . Furthermore, we define

[TABLE]

and one type of supervised estimator is then defined as follows

[TABLE]

Interestingly, conditional on the data sample $\mathcal{D}_{\lfloor n/2\rfloor}$ , the definition of the estimator $\hat{G}$ does not involves the labels associated to $\mathcal{D}_{\lceil n/2\rceil}$ . As a consequence, we can naturally consider a semi-supervised version of this estimator.

•

Semi-supervised procedure. Based on $\mathcal{D}_{n}$ , we consider an estimator $\hat{p}$ of the regression function $p$ . Furthermore, we define

[TABLE]

and one type of semi-supervised estimator is then defined as follows

[TABLE]

One can note that these procedures are based on a preliminary estimator of $p$ built from $\mathcal{D}_{n}$ , that is, all of them are plug-in type procedures. However, these procedures differ by the construction of the output set. The top- $\beta$ procedure and the supervised procedure rely only on the labeled data while the semi-supervised estimator takes advantage of the information provided by the unlabeled data. The top- $\beta$ procedure is the simplest among them, it naturally satisfies $|\hat{\Gamma}(x)|=\beta$ for all $x\in\mathbb{R}^{d}$ . At the same time, the others are more involved and can have different cardinals for different values of $x\in\mathbb{R}^{d}$ . Nevertheless, for the other two procedures one can guarantee $\operatorname{I}(\hat{\Gamma})\approx\beta$ .

These examples give a rise to natural questions which form the core our theoretical study and which are summarized below.

The first question is the statistical performance of these plug-in procedures which is assessed through rates of convergence and their optimality in the minimax sense. 2. 2.

The second question focuses on the benefit of the semi-supervised approach. Roughly speaking, are there situations where the semi-supervised approach outperforms the supervised one and how can it be quantified? 3. 3.

The third question concentrates on the reason why it is more relevant for this problem to consider more involved estimators than the simple top- $\beta$ method.

1.3 Minimax estimation

For a given family $\mathcal{P}$ of joint distributions on $\mathbb{R}^{d}\times[K]$ , a given estimator $\hat{\Gamma}\in\hat{\Upsilon}$ , and fixed integers $K\geq 2$ , $\beta\in[K]$ , $n,N\in\mathbb{N}$ we are interested in the following maximal risks

[TABLE]

where $\mathbb{E}_{(\mathcal{D}_{n},\mathcal{D}_{N})}$ denotes the expectation w.r.t. $\mathbb{P}^{\otimes n}\otimes\mathbb{P}_{X}^{\otimes N}$ . These maximal risks are arising in a natural way in the context of the confidence set estimation with controlled expected size. The risk $\Psi^{\operatorname{H}}_{n,N}(\hat{\Gamma};\mathcal{P})$ corresponds to the estimation of the $\beta$ -Oracle through the Hamming distance. The second risks is directly connected with Proposition 1.2, which gives a description of the $\beta$ -Oracle as a minimizer of $\operatorname{R}_{\beta}(\cdot)$ . As the goal in this problem is to construct a procedure $\hat{\Gamma}$ that exhibits a low error $\operatorname{P}(\hat{\Gamma})$ and low cardinal discrepancy $\lvert\beta-\operatorname{I}(\hat{\Gamma})\rvert$ , it is natural to consider $\Psi^{\operatorname{D}}_{n,N}(\hat{\Gamma};\mathcal{P})$ which is composed of both.

Finally, we are in position to define the notion of the minimax rate. The minimax rate in this context is not only determined by the family of distributions $\mathcal{P}$ but also by the family of estimators $\hat{{\bf\Gamma}}\subset\hat{\Upsilon}$ that we consider.

Definition 1.5 (Minimax rate of convergence).

For a given family $\mathcal{P}$ of joint distributions on $\mathbb{R}^{d}\times[K]$ and a given family of estimators $\hat{{\bf\Gamma}}\subset\hat{\Upsilon}$ the minimax rates are defined as

[TABLE]

where $\square$ is $\operatorname{H}$ , $\operatorname{E}$ or $\operatorname{D}$ .

The main families of estimators that we study are the supervised $\hat{\Upsilon}_{\operatorname{SE}}$ and the semi-supervised $\hat{\Upsilon}_{\operatorname{SSE}}$ estimators. Obviously, since $\hat{\Upsilon}=\hat{\Upsilon}_{\operatorname{SE}}\bigcup\hat{\Upsilon}_{\operatorname{SSE}}$ and $\hat{\Upsilon}_{\operatorname{SE}}\bigcap\hat{\Upsilon}_{\operatorname{SSE}}=\emptyset$ , we have the following relation

[TABLE]

As a consequence, a lower and an upper bounds on $\Psi^{\square}_{n,N}(\hat{\Upsilon}_{\operatorname{SE}};\mathcal{P})$ , $\Psi^{\square}_{n,N}(\hat{\Upsilon}_{\operatorname{SSE}};\mathcal{P})$ yield the bounds on the minimax rate over all estimators.

1.4 Related works

Confidence set approach for classification was pioneered by Vovk (2002a, b); Vovk, Gammerman and Shafer (2005) by the means of conformal prediction theory. They rely on non-conformity measures which are based on some pattern recognition methods, and develop an asymptotic theory. In this work, we consider a statistical perspective of confidence set classification and put our focus on non-asymptotic minimax theory.

The problem of confidence set multi-class classification has strong ties with the binary classification with reject option, also known as binary classification with abstention in machine learning literature. In the binary classification with rejection, a classifier is allowed to output some special symbol, which indicates the rejection. Such type of classifiers can be seen as confidence sets, which are allowed to output $\emptyset$ or $\{0,1\}$ and are interpreted as reject. This line of research was initiated by Chow (1957, 1970) in the context of information retrieval, where a predefined cost of rejection was considered. An extensive statistical study of this framework was carried in (Herbei and Wegkamp, 2006; Bartlett and Wegkamp, 2008; Wegkamp and Yuan, 2011).

Instead of considering a fixed cost for rejection, which might be too restrictive, one may define two entities: probability of rejection and the probability of missclassification. In the spirit of conformal prediction, Lei (2014) aims at minimizing the probability rejection provided a fixed upper bound on the probability of missclassification. In contrast, Denis and Hebiri (2015) consider a reversed problem of minimizing the probability of missclassification given a fixed upper bound on the probability of rejection.

Once the multi-class classification is considered, there are several possible ways to extend the binary case: the confidence set approach and the rejection approach. The reject counterpart is a more studied and known version, though it lacks statistical analysis. To the best of our knowledge the only work which provides statistical guarantees is (Ramaswamy, Tewari and Agarwal, 2018).

As for the confidence set approach there are again two possibilities, similar to the binary case. The one that is considered in this work was proposed by Denis and Hebiri (2017), where the authors analyse an ERM algorithm and derive oracle inequalities under the margin assumption (Tsybakov, 2004). More specifically, they consider a convex surrogate of the error $\operatorname{P}(\cdot)$ which relies on a convex real valued loss function $\phi$ . For a suitable choice of the convex function $\phi$ they show that, under Assumption 1.1, their $\beta$ -Oracle satisfies

[TABLE]

where the function $f^{*}$ depends on $\phi$ and the value of $G_{f^{*}}^{-1}(\beta)$ is defined similarly to the present manuscript. They propose a two step estimation procedure of the $\beta$ -Oracle set. Based on the ERM algorithm, they first estimate $f^{*}$ and in the second step, they estimate the threshold $G_{f^{*}}^{-1}(\beta)$ with an unlabeled sample. This procedure is in the same spirit as the semi-supervised procedure (1.5). Under mild assumptions, they provide an upper bound on the excess risk and obtain a rate of convergence of order $({n}/{\log n})^{-\alpha/(\alpha+s)}+{N}^{-1/2}$ , with $s$ being a parameter that depends on the function $\phi$ and $\alpha$ being the margin parameter. Note that this rate is slower than the rate obtained in the standard classification framework.

The conformal prediction theory (Vovk, Gammerman and Shafer, 2005) suggests to minimize the information level with a fixed budget on the error level. Statistical properties of this framework were considered in the work of Sadinle, Lei and Wasserman (2018). Their objective is formulated for some $a\in(0,1)$ as

[TABLE]

and such a confidence set is called a least ambiguous confidence set with bounded error rate. The authors show that under Assumption 1.1 this oracle set can be described as a thresholding of the regression function

[TABLE]

where the threshold $t_{a}$ is defined as

[TABLE]

Notice that this framework is very similar to (Denis and Hebiri, 2017) in the treatment of the Bayes optimal confidence set, as in both cases they are obtained via thresholding of the posterior distribution of the labels. Sadinle, Lei and Wasserman (2018) also proceed in two steps as here, that is, they first estimate the posterior distribution $p_{k}(\cdot)$ for all $k\in[K]$ and estimate the threshold $t_{a}$ after. However, they require the second labeled dataset for the estimator of $t_{a}$ , due to the presence of $\mathbb{P}(Y=k)$ , the marginal distribution of the labels. Besides, their theoretical analysis is carried out under a different set of assumptions on the joint distribution $\mathbb{P}$ . Apart from the standard margin assumption, they require a so-called detectability, that is, they require that the upper bound in the margin assumption is tight. Under these assumptions they provide an upper bound on the Hamming excess risk and obtain a rate of convergence of order $\mathcal{O}((n/\log n)^{-1/2})$ .

Interestingly, both approaches can be encompassed into the constrained estimation framework (Anbar, 1977; Lepskii, 1990; Brown and Low, 1996), where one would like to construct an estimator with some prescribed properties. These properties are typically reflected by the form of the risk which in our case is the discrepancy measure, that is, the sum of error and information discrepancies. Thus, both frameworks of Sadinle, Lei and Wasserman (2018); Denis and Hebiri (2017) can be seen as an extension of the constrained estimation to the classification problems. From the modeling point of view, we believe that the two frameworks can co-exist nicely and a particular choice depends on the considered application. The major difference between the present work and those by Denis and Hebiri (2017) and Sadinle, Lei and Wasserman (2018) is the minimax analysis which we provide here and our treatment of semi-supervised techniques.

As already pointed out, the confidence set estimation problem is closely related to the level set estimation setup (Hartigan, 1987; Polonik, 1995; Tsybakov, 1997; Rigollet and Vert, 2009). This problem focuses on the estimation of a level set defined as

[TABLE]

where $p$ is the density of the observations and $\lambda>0$ is some fixed value. Given a sample $X_{1},\ldots,X_{n}$ distributed according the density $p$ the goal is to estimate $\Gamma_{p}(\lambda)$ . In (Rigollet and Vert, 2009), the authors study plug-in density level set estimators through the measure of symmetric differences and the excess mass. In confidence set estimation the measure of symmetric differences is the Hamming risk whereas the excess mass is the excess risk. They show that kernel based estimators are optimal in the minimax sense over a Hölder class of densities and under a margin type assumption (Polonik, 1995; Tsybakov, 2004). In particular, they derive fast rates of convergence, that is faster than $n^{-1/2}$ , for the excess mass. In the level set estimation problem, the threshold $\lambda$ is chosen beforehand; whereas in our work, the threshold $G^{-1}(\beta)$ depends on the distribution of the data which makes the statistical analysis more difficult.

On the other part, the confidence set estimation problem is directly related to the standard classification settings. This problem has been widely studied from a theoretical point of view in the binary classification framework. Audibert and Tsybakov (2007) study the statistical performance of plug-in classification rules under assumptions which involve the smoothness of the regression function and the margin condition. In particular, they derive fast rates of convergence for plug-in classifiers based on local polynomial estimators (Stone, 1977; Tsybakov, 1986; Audibert and Tsybakov, 2007) and show their optimality in the minimax sense. One of the aim of present work is to extend these results to the confidence set classification framework.

Another part of our work is to provide a comparison between supervised and semi-supervised procedures. Semi-supervised methods are studied in several papers (Vapnik, 1998; Rigollet, 2007; Singh, Nowak and Zhu, 2009; Bellec et al., 2018) and references therein. A simple intuition can be provided on whether one should or not expect a superior performance of the semi-supervised approach. Imagine a situation when the unlabeled sample $\mathcal{D}_{N}$ is so large that one can approximate $\mathbb{P}_{X}$ up to any desired precision, then, if the optimal decision is independent of $\mathbb{P}_{X}$ , the semi-supervised estimators are not to be considered superior over the supervised estimation. This is the case in a lot of classical problems of statistics, where the inference is solely governed by the behavior of the conditional distribution $\mathbb{P}_{Y|X}$ (for instance regression or binary classification). The situation might be different once the optimal decision relies on the marginal distribution $\mathbb{P}_{X}$ . In this case, as suggested by our findings, the semi-supervised approach might or not outperform the supervised one even in the context of the same problem. Similar conclusions were stated by Singh, Nowak and Zhu (2009) in the context of learning under the cluster assumption (Rigollet, 2007).

1.5 Main contributions

Bellow we summarize our contributions.

•

Our results focus on the case where the regression $p$ belongs to a Hölder class and satisfy the margin condition. Under these assumptions, we establish lower bounds on the minimax rates, defined in Section 1.3 in the confidence set framework.

•

As important consequences of our results, we first show that top- $\beta$ type procedures are in general inconsistent. Furthermore, by providing a rigorous definition of the semi-supervised and supervised estimators, we describe the situations when the semi-supervised estimation should be considered superior to its supervised counterpart. Interestingly, our analysis suggests that these regimes are governed by the interplay of the family of distributions and by the considered measure of performance. Besides, we show that in our settings supervised procedures cannot achieve fast rates, that is, its rate cannot be faster than $n^{-1/2}$ . In contrast, some other classical settings (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009; Herbei and Wegkamp, 2006) allow to achieve faster rates for supervised methods.

•

We provide supervised and semi-supervised estimation procedures, which are optimal or optimal up to an extra logarithmic factor. Importantly, our results show that semi-supervised plug-in procedure based on local polynomial estimators can achieve fast rates, provided that the size of the unlabeled samples is large enough.

•

Finally, we perform a numerical evaluation of the proposed plug-in algorithms against the top- $\beta$ counterparts. This part supports our theoretical results and empirically demonstrates the reason to consider more involved procedures.

1.6 Organization of the paper

The paper is organized as follow. In Section 2, we put some additional notation and introduce the family of distributions $\mathcal{P}$ that we consider. Section 3 is devoted to the lower bounds on the minimax rates and their implications. In Section 4 we introduce the proposed algorithm, establish upper bounds for it, and evaluate its numerical performance. We conclude this paper by Sections 5 and 6 where we discuss and sum-up our results.

2 Class of confidence sets

First let us introduce some generic notation that is used throughout this work. For two numbers $a,a^{\prime}\in\mathbb{R}$ we denote by $a\vee a^{\prime}$ (resp. $a\wedge a^{\prime}$ ) the maximum (resp. minimum) between $a$ and $a^{\prime}$ . For a positive real number $a$ we denote by $\lfloor a\rfloor$ (resp. $\lceil a\rceil$ ) the largest (resp. the smallest) non-negative integer that is less than or equal (resp. greater than or equal) to $a$ . The standard Euclidean norm of a vector $x\in\mathbb{R}^{d}$ is denoted by $\left\lVert x\right\rVert$ and the standard Lebesgue measure is denoted by $\operatorname{Leb}(\cdot)$ . A Euclidean ball centered at $x\in\mathbb{R}^{d}$ of radius $r>0$ is denoted by $\mathcal{B}(x,r)$ . For an arbitrary Borel measure $\mu$ on $\mathbb{R}^{d}$ that is absolutely continuous *w.r.t. *the Lebesgue measure we denote by $\operatorname{supp}(\mu)$ its support, that is, the set where the Radon-Nikodym derivative of $\mu$ w.r.t. $\operatorname{Leb}$ is strictly positive. For a vector function $p:\mathbb{R}^{d}\mapsto\mathbb{R}^{K}$ and a Borel measure $\mu$ on $\mathbb{R}^{d}$ we define the infinity norm of $p$ as

[TABLE]

In this work $C$ or its lower-cased versions always refer to some constants which might different from line to line. Importantly, all these constants are independent of $n,N$ but could depend on $K,d$ and other parameters which are assumed to be fixed. Before introducing the families of distributions $\mathcal{P}$ that are considered in this work we need the following definitions.

Assumption 2.1 ( $\alpha$ -margin assumption).

We say that the distribution $\mathbb{P}$ of the pair $(X,Y)\in\mathbb{R}^{d}\times[K]$ satisfies $\alpha$ -margin assumption if there exists $C_{1}>0$ and $t_{0}\in(0,1)$ such that for every positive $t\leq t_{0}$

[TABLE]

Let us point out an important consequence of Assumption 1.1. We have that the condition

[TABLE]

for all $t\in[0,t_{0}]$ is equivalent to Assumption 2.1. Indeed, since the random variables $p_{k}(X)$ ’s cannot concentrate at a constant level, in particular at $G^{-1}(\beta)$ . Moreover, again due to the continuity Assumption 1.1 we have

[TABLE]

thus the $\alpha$ -margin Assumption 2.1 specifies the rate of this convergence. Finally, the restriction of the range of $t$ to $[0,t_{0}]$ in $\alpha$ -margin Assumption 2.1 does not affect its global behavior as for all $t\in[0,1]$

[TABLE]

Let $c_{0}$ and $r_{0}$ be two positive constants. We say that a Borel set $A\subset\mathbb{R}^{d}$ is a $(c_{0},r_{0})$ -regular set if

[TABLE]

Definition 2.2 (Strong density).

We say that the probability measure $\mathbb{P}_{X}$ on $\mathbb{R}^{d}$ satisfies the $(\mu_{\min},\mu_{\max},c_{0},r_{0})$ -strong density assumption if it is supported on a compact $(c_{0},r_{0})$ -regular set $A\subset\mathbb{R}^{d}$ and has a density $\mu$ w.r.t. the Lebesgue measure such that $\mu(x)=0$ for all $x\in\mathbb{R}^{d}\setminus A$ and

[TABLE]

Definition 2.3 (Hölder class, Tsybakov (2008)).

We say that a function $h:\mathbb{R}^{d}\rightarrow\mathbb{R}$ is $(\gamma,L)$ -Hölder for $\gamma>0$ and $L>0$ if $h$ is $\lfloor\gamma\rfloor$ times continuously differentiable and $\forall x,x^{\prime}\in\mathbb{R}^{d}$ we have

[TABLE]

where $h_{x}(\cdot)$ is the Taylor polynomial of degree $\lfloor\gamma\rfloor$ of $h(\cdot)$ at the point $x\in\mathbb{R}^{d}$ . Consequently, the set of all functions from $\mathbb{R}^{d}$ to $\mathbb{R}$ satisfying the above conditions is called $(\gamma,L,\mathbb{R}^{d})$ -Hölder and is denoted by $\mathcal{H}(\gamma,L,\mathbb{R}^{d})$ .

Definition 2.4.

We denote by $\mathcal{P}(L,\gamma,\alpha)$ a set of joint distributions on $\mathbb{R}^{d}\times[K]$ which satisfies the following conditions

•

the marginal $\mathbb{P}_{X}$ satisfies the $(\mu_{\min},\mu_{\max},c_{0},r_{0})$ -strong density,

•

for all $k\in[K]$ the $k^{\text{th}}$ regression function $p_{k}(\cdot)=\mathbb{P}(Y=k|X=\cdot)$ belongs to the $(\gamma,L,\mathbb{R}^{d})$ -Hölder class, that is $p_{k}\in\mathcal{H}(\gamma,L,\mathbb{R}^{d})$ for all $k\in[K]$ ,

•

for all $k\in[K]$ the regression function $p_{k}$ satisfy the $(C_{1},\alpha,\beta)$ -Margin assumption,

•

for all $k\in[K]$ , the cumulative distribution function $F_{p_{k}}$ of $p_{k}(X)$ is continuous.

The family of distributions $\mathcal{P}(L,\gamma,\alpha)$ is similar to the one considered in (Audibert and Tsybakov, 2007) in the context of binary classification. The only major difference is the continuity Assumption 1.1, which does not allow to re-use in a straightforward way their construction for lower bounds.

3 Lower bounds

The main results in the present work are the lower bounds we provide in this section. In particular, we establish in Section 3.1 the inconsistency of top- $\beta$ procedures (see Eq. (1.3) for a definition of the method). Therefore more elaborate methods are required in this framework. As pointed out in the introduction, we distinguish two types of estimators: supervised and semi-supervised for which we provide lower bounds in Section 3.2. The obtained rates highlight the benefit of the semi-supervised approach in the context of the confidence set classification.

Before considering the lower bounds, let us first display connection between the different minimax rates. Such links are used in the proofs of the lower bounds.

Proposition 3.1.

Let $\Gamma$ be a measurable function from $\mathbb{R}^{d}$ to $2^{[K]}$ , $\beta\in[k]$ and assume that Assumption 1.1 is fulfilled, then

[TABLE]

Furthermore, if additionally Assumption 2.1 is satisfied with $\alpha>0$ , then there exist $C>0$ which depends only on $K,\alpha,C_{1}$ such that for any pair of confidence set classifiers $\Gamma,\Gamma^{{}^{\prime}}$ it holds that

[TABLE]

Proposition 3.2.

For any $K\geq 2$ , $\beta\in[K]$ and $n,N\in\mathbb{N}$ the following relation between minimax rates holds:

[TABLE]

Proposition 3.1, and in particular Eq. (3.1) gives an easy way to establish a lower bound on $\Psi^{\operatorname{E}}_{n,N}(\hat{{\bf\Gamma}};\mathcal{P})$ via a lower bound on the Hamming distance $\Psi^{\operatorname{H}}_{n,N}(\hat{{\bf\Gamma}};\mathcal{P})$ . However, this approach does not allow to get $(N+n)^{-1/2}$ (resp. $n^{-1/2}$ ) part of the rate in the lower bound of $\Psi^{\operatorname{E}}_{n,N}(\hat{\Upsilon}_{\operatorname{SSE}},\mathcal{P})$ (resp. $\Psi^{\operatorname{E}}_{n,N}(\hat{\Upsilon}_{\operatorname{SE}},\mathcal{P})$ ). Besides, Proposition 3.2 allows to prove a lower bound on the discrepancy $\Psi^{\operatorname{D}}_{n,N}(\hat{{\bf\Gamma}};\mathcal{P})$ with the correct rate via the lower bound on the excess risk $\Psi^{\operatorname{E}}_{n,N}(\hat{{\bf\Gamma}};\mathcal{P})$ .

3.1 Inconsistency of the top- $\beta$ procedure

Before stating our results on the supervised and the semi-supervised estimators, we discuss another interesting class of confidence sets, which might be a natural choice at the first sight. We consider estimators which consists of $\beta$ classes at every point $x\in\mathbb{R}^{d}$ since such estimators naturally satisfy $\operatorname{I}{(\hat{\Gamma})}=\beta$ . Let us denote by $\hat{\Upsilon}_{\beta}$ the set of all estimators $\hat{\Gamma}$ such that $|{{\hat{\Gamma}(x)}}|=\beta$ for all $x\in\mathbb{R}^{d}$ , that is,

[TABLE]

Despite an obvious restriction on the cardinal of the confidence sets, the family of estimators $\hat{\Upsilon}_{\beta}$ is rather broad. Indeed, every procedure which estimates the regression functions $p_{k}(\cdot)$ ’s and includes the top $\beta$ scores as the output are included in $\hat{\Upsilon}_{\beta}$ . The nature of the estimator can also be different, that is, the estimates could be based on the ERM, non-parametric or parametric approaches. Clearly, the family $\hat{\Upsilon}_{\beta}$ is neither included in $\hat{\Upsilon}_{\operatorname{SE}}$ nor in $\hat{\Upsilon}_{\operatorname{SSE}}$ and has a non-trivial intersection with both. The next result states that there is no uniformly consistent estimator $\hat{\Gamma}\in\hat{\Upsilon}_{\beta}$ over the family of distributions $\mathcal{P}(L,\gamma,\alpha)$ .

Proposition 3.3.

Assume that $K\geq 4$ , $\beta\in[\lfloor K/2\rfloor-1]$ and $\beta\geq 2$ , then for all $n,N\in\mathbb{N}$ we have

[TABLE]

The proof builds an explicit construction of a distribution $\mathbb{P}$ whose $\beta$ -Oracle satisfies $\lvert\Gamma^{*}_{\beta}(x)\rvert>\beta$ for all $x$ in some $A\subset\mathbb{R}^{d}$ with $\mathbb{P}_{X}(A)>0$ . Indeed, if such a distribution exists then there is no estimator in $\hat{\Upsilon}_{\beta}$ that would consistently estimate this $\beta$ -Oracle. The negative result established in Proposition 3.3 is rather instructive by itself as it advocates that a more involved estimation procedure ought to be constructed.

3.2 Supervised vs semi-supervised estimation

Clearly, estimators which achieve the infimum in the minimax rates are either supervised or semi-supervised, thus a lower bound on $\Psi^{\square}_{n,N}(\hat{\Upsilon}_{\operatorname{SE}};\mathcal{P})$ together with a lower bound on $\Psi^{\square}_{n,N}(\hat{\Upsilon}_{\operatorname{SSE}};\mathcal{P})$ yield a lower bound on $\Psi^{\square}_{n,N}(\hat{\Upsilon};\mathcal{P})$ . However, a lower bound on $\Psi^{\square}_{n,N}(\hat{\Upsilon};\mathcal{P})$ does not discriminate between the supervised and the semi-supervised estimators.

Theorem 3.4 (Supervised estimation).

Let $K\leq 3$ , $\beta\in[\lfloor K/2\rfloor-1]$ . If $2\alpha\lceil\frac{\gamma}{2}\rceil\leq d$ , then there exist constants $c,c^{\prime},c^{\prime\prime}>0$ such that for all $n,N\in\mathbb{N}$

[TABLE]

Based on this results we observe that the lower bound for the Hamming risk $\Psi^{\operatorname{H}}_{n,N}$ is slower than those for the other risks. It is even more significant that the best rate that a supervised estimator can achieve for all of the risks is $n^{-1/2}$ even if the margin assumption holds. This is the major difference with the classical settings where the value of threshold is known (such as classification and level set estimation). Indeed, under the same assumptions on the family of distributions, besides the continuity Assumption 1.1, the minimax rate in those frameworks is $n^{-{(1+\alpha)\gamma}/{(2\gamma+d)}}$ as proved for instance in (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009). Next theorem deals with semi-supervised procedures and displays another behavior.

Theorem 3.5 (Semi-supervised estimation).

Let $K\geq 3$ , $\beta\in[\lfloor K/2\rfloor-1]$ . If $2\alpha\lceil\frac{\gamma}{2}\rceil\leq d$ , then there exist constants $c,c^{\prime},c^{\prime\prime}>0$ such that for all $n,N\in\mathbb{N}$

[TABLE]

First, observe that the lower bound for the Hamming distance is, as in the supervised setting, worse than for the other measures of performance. However there is a major difference with the supervised case: as compared to Theorem 3.4, it is possible for a semi-supervised estimator to achieve rates that are faster than $n^{-1/2}$ if the size of the unlabeled dataset $N\in\mathbb{N}$ is large enough. In particular, when we consider $\Psi^{\operatorname{E}}_{n,N}$ or $\Psi^{\operatorname{D}}_{n,N}$ the following relations are necessary to get fast rates

[TABLE]

In this case, we recover the same fast rates as in the classical settings of classification and level set estimation. It suggests that the lack of knowledge of the threshold $G^{-1}(\beta)$ does not alter the quality of estimation for the semi-supervised procedure, provided that $N$ is sufficiently large. Next corollary makes these observations clearer.

Corollary 3.6.

Assume that the rates in Theorem 3.5 (resp. Theorem 3.4) are minimax, that is, there exist a confidence set $\hat{\Gamma}_{\operatorname{SSE}}$ (resp. $\hat{\Gamma}_{\operatorname{SE}}$ ) that achieves these rates. Regarding $\Psi^{\operatorname{E}}_{n,N}$ and $\Psi^{\operatorname{D}}_{n,N}$ the following conclusions hold

•

There is no semi-supervised estimator that achieves faster rate than $\hat{\Gamma}_{\operatorname{SE}}$ if:

[TABLE]

•

The rate of $\hat{\Gamma}_{\operatorname{SSE}}$ is faster than the rate of any supervised estimator if:

[TABLE]

Moreover, if there exists $\rho>0$ such that $n^{1+\rho}=o(N)$ , then the rate of $\hat{\Gamma}_{\operatorname{SSE}}$ is polynomially faster than $n^{-1/2}$ .

•

The rate of $\hat{\Gamma}_{\operatorname{SSE}}$ is fast similarly to the classical frameworks if

[TABLE]

Clearly, similar observation is true for the Hamming risk $\Psi^{\operatorname{H}}_{n,N}$ ; however the regime when improvement is possible thanks to semi-supervised approaches is narrowed as $n^{-{(1+\alpha)\gamma}/{(2\gamma+d)}}=o\left(n^{-{\alpha\gamma}/{(2\gamma+d)}}\right)$ . We summarize Corollary 3.6 in Table 3.2.

Essentially, the above results suggest that the advantage of the semi-supervised approaches over the supervised ones depends not only on the underlying family of distributions $\mathcal{P}$ but also on the metric that is considered. Yet, necessary and sufficient conditions that must be imposed in general on the problem and the metric so that the semi-supervised estimation provably improve upon the supervised one remain an open problem.

A final remark we could make before going further concerns the assumption on the parameters $\alpha$ and $\gamma$ . The condition $2\alpha\lceil\frac{\gamma}{2}\rceil\leq d$ in the lower bounds is slightly more restrictive than the conditions given in (Audibert and Tsybakov, 2007) (they have $\alpha\gamma\leq d$ ). We believe that this is an artifact of our proof and could be avoided with a finer choice of hypotheses. Simple modifications of the lower bound of Audibert and Tsybakov (2007) do not work in our settings because their hypotheses are not satisfying Assumption 1.1. In contrast, the construction of Rigollet and Vert (2009) satisfies111Modified properly to fit the classification framework. Assumption 1.1 but their lower bound is limited by the condition $\alpha\gamma\leq 1$ , that is, it does not cover the fast rates as long as the dimension $d>2$ .

3.3 Sketch of the proof

In order to prove the lower bounds of Theorems 3.4, 3.5 we actually prove two separate lower bounds on the minimax rates. The two lower bounds that we prove are naturally connected with the proposed two-step estimator in Eq. (1.5), that is, the first lower bound is connected with the problem of non-parametric estimation of $p_{k}$ for all $k\in[K]$ and the second describes the estimation of the unknown threshold $G^{-1}(\beta)$ .

In particular, the first lower bound is closely related to the one provided in (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009), however, the continuity Assumption 1.1 makes the proof more involved and results in a final construction of hypotheses that differs significantly. This part of our lower bound relies on Fano’s inequality in the form of Birgé (2005). The second lower bound is based on two hypotheses testing and is derived by constructing two different marginal distributions of $X\in\mathbb{R}^{d}$ which are sufficiently close and a fixed regression function $p(\cdot)$ . Crucially, these marginal distributions admit two different values of threshold $G^{-1}(\beta)$ and thus two different $\beta$ -Oracle. In this part we make use of Pinsker’s inequality, see for instance (Tsybakov, 2008).

In order to discriminate the supervised and the semi-supervised procedures we make use of Definition 1.4. Notice that every supervised procedure thanks to Definition 1.4 is not “sensitive” to the expectation taken *w.r.t. *the unlabeled dataset $\mathcal{D}_{N}$ , that is, randomness is only induced by the labeled dataset $\mathcal{D}_{n}$ . This strategy allows to eliminate the dependence of the lower bound on the size of the unlabeled dataset $\mathcal{D}_{N}$ for supervised procedures. Informally, the lower bound on $\Psi^{\square}_{n,N}(\hat{\Upsilon}_{\operatorname{SE}};\mathcal{P})$ is obtained from the lower bound on $\Psi^{\square}_{n,N}(\hat{\Upsilon}_{\operatorname{SSE}};\mathcal{P})$ by setting $N=0$ .

4 Upper bounds

In this section, we show that we can build confidence set estimators that achieve, up to a logarithmic factor, the lower bounds stated in Theorems 3.4-3.5. In other words, those estimators are nearly optimal in the minimax sense. To come straight to the point, we delay the construction of the estimators to Section 4.1 and their properties to Section 4.2, and focus right now on their upper bounds.

Theorem 4.1 (Supervised estimation).

Let $K\in\mathbb{N}$ , $\beta\in[K-1]$ , then there exists a supervised estimator $\hat{\Gamma}_{\operatorname{SE}}\in\hat{\Upsilon}_{\operatorname{SE}}$ and a constant $C>0$ such that for all $n,N\in\mathbb{N}$ we have

[TABLE]

Theorem 4.2 (Semi-supervised estimation).

Let $K\in\mathbb{N}$ , $\beta\in[K-1]$ , then there exists a semi-supervised estimator $\hat{\Gamma}_{\operatorname{SSE}}\in\hat{\Upsilon}_{\operatorname{SSE}}$ and constants $C,C^{\prime},C^{\prime\prime}>0$ such that for all $n,N\in\mathbb{N}$ we have

[TABLE]

We show here that the lower bounds of Theorems 3.4-3.5 are achievable. In particular, in the case of Hamming risk, the upper bounds are optimal; whereas for the Excess risk and the Discrepancy, the upper bounds fit the lower bounds up to a logarithmic factor. Thus, the comments we made in Corollary 3.6 are correct. Let us mention that the presence of the logarithmic factor in these upper bounds is due to $\ell_{\infty}$ -norm estimation (see Lemma 4.5).

Hamming risk as a measure of performance was considered in the settings of Sadinle, Lei and Wasserman (2018). They also establish upper bounds for this measure, though do not assess their optimality. Besides, as we already mentioned, Denis and Hebiri (2017) provide an upper bound on the excess risk in the context of ERM. Let us point out, that the comparison with these two works is not fair as the assumptions and even frameworks under which we and they formulate results are different.

4.1 Construction of the estimators

Building estimators $\hat{\Gamma}_{\operatorname{SE}}$ and $\hat{\Gamma}_{\operatorname{SSE}}$ that reach the rates in the former upper bounds involves a preliminary estimators $\hat{p}_{k}$ of the regression functions $p_{k}$ , $k\in[K]$ . These estimators $\hat{p}_{k}$ are constructed using an arbitrary half $\mathcal{D}_{\lfloor n/2\rfloor}$ of the labeled dataset $\mathcal{D}_{n}$ and they satisfy the following assumptions.

Assumption 4.3 (Exponential concentration).

There exist estimators $\hat{p}_{k}$ for all $k\in[K]$ based on $\mathcal{D}_{\lfloor n/2\rfloor}$ and positive constants $C_{1},C_{2}$ such that for all $k\in[K]$ and all $n\geq 2$ we have for all $\delta>0$

[TABLE]

*for almost all $x\in\mathbb{R}^{d}$ w.r.t. $\mathbb{P}_{X}$ . *

Assumption 4.4 (Continuity of CDF).

For all $k\in[K]$ the cumulative distribution function $F_{\hat{p}_{k}}(t)\vcentcolon=\mathbb{P}_{X}(\hat{p}_{k}(X)\leq t)$ of $\hat{p}_{k}(X)$ is almost surely $\mathbb{P}^{\otimes\lfloor n/2\rfloor}$ continuous on $(0,1)$ .

First let us point out that Assumption 4.3 induces that there exists a constant $C>0$ such that for all $n\geq 2$ and all $\alpha>0$

[TABLE]

Assumption 4.3 is commonly used in the statistical community when we deal with rates of convergence in the classification settings (Audibert and Tsybakov, 2007; Lei, 2014; Sadinle, Lei and Wasserman, 2018). It is for instance satisfied by the locally polynomial estimator (Stone, 1977; Tsybakov, 1986; Audibert and Tsybakov, 2007). Assumption 4.4 can always be satisfied by slightly processing any estimator $\hat{p}$ . Indeed, assume Assumption 4.4 fails to be satisfied by some estimator $\hat{p}$ . It means that there exists a subset of $\mathbb{R}^{d}$ of non-zero measure such that at least one $\hat{p}_{k}$ , with $k\in[K]$ , is constant on this set. Then, if we add a deterministic continuous function of a sufficiently bounded variation222It is sufficient to make sure that adding the function preserves its statistical properties, that is, Assumption 4.3 to $\hat{p}$ such regions can no longer exist.

Since, the threshold level $G^{-1}(\beta)$ is not known beforehand, it ought to be estimated using data. A straightforward estimator of this threshold can be constructed using the unlabeled dataset $\mathcal{D}_{N}$ . To make our presentation mathematically correct we introduce the following notation $\mathcal{D}_{n}=\mathcal{D}_{\lfloor n/2\rfloor}\bigcup\mathcal{D}_{\lceil n/2\rceil}$ , where $\mathcal{D}_{\lfloor n/2\rfloor}$ is the dataset used to build the estimators $\hat{p}_{k}$ for $k\in[K]$ . Now, all the labels are removed from $\mathcal{D}_{\lceil n/2\rceil}$ , that is it consists of $\lceil n/2\rceil$ *i.i.d. *samples from $\mathbb{P}_{X}$ . The supervised and semi-supervised estimators of $G(\cdot)$ are defined as

[TABLE]

respectively. Finally, we are in position to define $\hat{\Gamma}_{\operatorname{SE}}$ and $\hat{\Gamma}_{\operatorname{SSE}}$ as

[TABLE]

for all $x\in\mathbb{R}^{d}$ . Note that $\hat{\Gamma}_{\operatorname{SE}}$ is clearly supervised in the sense of Definition 1.4, as it is independent of the unlabeled sample $\mathcal{D}_{N}$ . In contrast, $\hat{\Gamma}_{\operatorname{SSE}}$ is semi-supervised, since we can find two samples $\mathcal{D}_{N}$ and $\mathcal{D}^{\prime}_{N}$ which induce different confidence sets. To show that the estimators introduced in this section satisfy the statements of Theorems 4.1-4.2 we refine the proof technique used in (Denis and Hebiri, 2017). That is, we introduce an intermediate quantity

[TABLE]

and the associated confidence set, which we refer to as the pseudo Oracle confidence set given for all $x\in\mathbb{R}^{d}$ by

[TABLE]

The confidence set $\tilde{\Gamma}$ assumes knowledge of the marginal distribution $\mathbb{P}_{X}$ and is seen as an idealized version of both $\hat{\Gamma}_{\operatorname{SE}}$ and $\hat{\Gamma}_{\operatorname{SSE}}$ , note however, that the pseudo Oracle $\tilde{\Gamma}$ is not an estimator.

4.2 Properties of the plug-in confidence sets

An important step of our analysis is the following lemma, that bounds the difference between $\tilde{G}^{-1}(\beta)$ and ${G}^{-1}(\beta)$ .

Lemma 4.5 (Upper bound on the thresholds).

Let Assumption 1.1 be satisfied, then for all $\beta\in[K]$

[TABLE]

The proof of Lemma 4.5 uses elementary properties of the generalized inverse functions which are provided in Appendix. Besides, let us mention, that the difference $\lvert G^{-1}(\beta)-\tilde{G}^{-1}(\beta)\rvert$ resembles the Wasserstein infinity distance which gives an alternative approach to prove Lemma 4.5, see (Bobkov and Ledoux, 2016). Lemma 4.5 explains the extra $\log n$ factor that appears in the upper bound, as the minimax estimation in sup norm contains the $\log n$ factor, see for instance (Stone, 1982; Tsybakov, 2008). Another important property of the introduced estimators $\hat{\Gamma}_{\operatorname{SE}}$ and $\hat{\Gamma}_{\operatorname{SSE}}$ is obtained via Assumption 4.4. It describes the deviation of the information of $\hat{\Gamma}_{\operatorname{SE}}$ and $\hat{\Gamma}_{\operatorname{SSE}}$ from the desired level $\beta$ .

Proposition 4.6 (Denis and Hebiri (2017)).

Let $\hat{p}_{k}$ for all $k\in[K]$ be arbitrary estimators of the regression functions constructed using $\mathcal{D}_{\lfloor n/2\rfloor}$ that satisfies Assumption 4.4, then there exist constants $C,C^{\prime}>0$ such that for all $n,N\in\mathbb{N}$ it holds that

[TABLE]

Note that if $\hat{p}_{k}$ satisfies Assumption 4.4 for all $k\in[K]$ , then $\beta=\operatorname{I}(\tilde{\Gamma})$ . This simple fact is a step in the proof of Proposition 4.6. Finally, combination of Lemma 4.5, Proposition 4.6, Assumption 4.3 with the peeling argument used in (Audibert and Tsybakov, 2007, Lemma 3.1) yields the results of Theorems 4.1-4.2.

4.3 Simulation study

The goal of this part is to numerically address the following points.

Is it more advantageous to go outside of the classical multi-class classification settings and consider the confidence set framework? To respond to this question we compute the Bayes optimal multi-class classifier and view it as a confidence set with one label. We compare this Bayes rule with the $\beta$ -Oracle in terms of the error $\operatorname{P}(\cdot)$ using various values of $\beta\in[K]$ and $K\in\mathbb{N}$ .

2)

How does the $\beta$ -Oracle confidence set compares to another ”Oracle” ( $\operatorname{top}$ - $\beta$ Oracle) which simply includes classes corresponding to the largest values of $p_{k}(\cdot)$ ’s?

3)

Does the proposed plug-in approach indeed gives a good approximation of the $\beta$ -Oracle through the error $\operatorname{P}(\cdot)$ and the information $\operatorname{I}(\cdot)$ ?

4)

Despite demonstrating the minimax inconsistency of the top- $\beta$ approach, we wonder whether in some scenarios it can achieve a comparable performance against our semi-supervised plug-in procedure.

We consider two simulation schemes depending on the parameter $K\in\{10,100\}$ . For each $K$ , we generate $(X,Y)$ according to a mixture model. More precisely,

i)

the label $Y$ follows uniform distribution on $[K]$ ;

ii)

conditional on $Y=k$ , the feature $X$ is generated according to a multivariate gaussian distribution with mean $\mu_{k}\in\mathbb{R}^{10}$ and identity covariance matrix.

For each $k\in[K]$ , the vectors $\mu_{k}$ are *i.i.d. *realizations of uniform distribution on $[0,4]^{10}$ . For this distribution, we have

[TABLE]

where for each $k\in[K]$ , $f_{k}(X)$ is the density function of a multivariate gaussian distribution with mean parameter $\mu_{k}$ and identity covariance matrix.

For each $K$ , the missclassification error of the classical multi-class classification Bayes rule is evaluated based on a sufficiently large dataset. It is valued at $0.22$ and at $0.60$ for $K=10$ and for $K=100$ respectively. These values are relatively high, which suggests that confusion is induced by the large number of classes. Hence, it is reasonable to apply the confidence set approach to this problem.

In the sequel, we aim at providing the estimation of the error of the $\beta$ -Oracle. To this end, for $\beta\in\{2,5,10,20\}$ and each $K$ , we repeat $B$ times the following steps.

i)

simulate two datasets $\mathcal{D}_{N}$ and $\mathcal{D}_{M}$ with $N=10000$ and $M=1000$ ; 2. ii)

based on $\mathcal{D}_{N}$ , we compute the empirical counterpart of $G$ and provide an approximation of the $\beta$ -Oracle $\Gamma^{*}_{\beta}$ given in Eq. (1.1) (we recall that this step requires a dataset which contains only unlabeled features); 3. iii)

finally, over $\mathcal{D}_{M}$ , we compute the empirical counterparts $\operatorname{P}_{M}$ (of $\operatorname{P}(\Gamma^{*}_{\beta})$ ) and $\operatorname{I}_{M}$ (of $\operatorname{I}(\Gamma_{\beta}^{*})$ ).

From this estimates, we compute the mean and the standard deviation of $\operatorname{P}_{M}$ and $\operatorname{I}_{M}$ . Tables 2 and 3 present values of the error and of the information which are achieved by the $\beta$ -Oracle and by the $\operatorname{top}$ - $\beta$ Oracle.

We now move towards the construction of our semi-supervised plug-in estimators $\hat{\Gamma}_{\operatorname{SSE}}$ . For each $K$ and each $\beta$ , we evaluate the performance of $\hat{\Gamma}_{\operatorname{SSE}}$ according to three different estimations of the regression function: the $\hat{p}_{k}$ ’s are based on random forests, softmax regression and deep learning procedures. Let us point out, that for random forests and softmax regression algorithms, the random variables $\hat{p}_{k}(X)$ appear to be not continuous. Hence Assumption 4.4 is violated. To alleviate this issue, we add to $\hat{p}_{k}(X)$ an independent small perturbation $|\mathcal{N}(0,e^{-10})|$ for simplicity. The evaluation of the performance of $\hat{\Gamma}_{\operatorname{SSE}}$ relies on the following steps

i)

simulate three datasets $\mathcal{D}_{n}$ , $\mathcal{D}_{N}$ and $\mathcal{D}_{M}$ ; 2. ii)

based on $\mathcal{D}_{n}$ , we compute the estimators $\hat{p}_{k}$ of $p_{k}$ according to the considered procedure; 3. iii)

based on $\mathcal{D}_{N}$ and $\hat{p}_{k}$ we compute the function $\hat{G}$ and the estimator $\hat{\Gamma}_{\operatorname{SSE}}$ as in Eq. (1.5) (we recall that this step requires a dataset which contains only unlabeled features); 4. iv)

finally, we compute over $\mathcal{D}_{M}$ the empirical counterpart of $\operatorname{P}$ and of $\operatorname{I}$ for the considered $\hat{\Gamma}_{\operatorname{SSE}}$ .

Again, during these experiments, we compute means and standard deviations. The parameters $K,n,N$ are fixed as follows: for $K=10$ , we fix $n=1000$ and $N\in\{100,10000\}$ ; for $K=100$ we fix $n=10000$ and $N\in\{100,10000\}$ . Finally, the size of $\mathcal{D}_{M}$ is fixed to $M=1000$ . The results are illustrated in Tables 4 and 5.

As benchmark for the continuation of our experiments, the classical missclassification errors of the multi-class classifiers based on random forests, softmax regression and deep learning methods are valued respectively at $0.28$ , $0.24$ , $0.29$ for $K=10$ , and at $0.65$ , $0.98$ $0.63$ for $K=100$ .

Turning to Table 2 we confirm the intuition that the error of the $\beta$ -Oracle decreases as the value of the parameter $\beta$ increases. Nevertheless, for moderate values of $\beta$ , compared to $K$ , we obtain a satisfactory improvement compared to standard multi-class classification Bayes rule. For instance, when $K=10$ and $\beta=2$ the error of the $2$ -Oracle confidence set is $0.05$ , whereas the Bayes classifier has $0.22$ ; likewise, when $K=100$ and $\beta=5$ the the classification error decreases from $0.60$ to $0.20$ . Table 2 shows that the $\operatorname{top}$ - $\beta$ Oracle is slightly outperformed by the $\beta$ -Oracle in terms of the error, but still performs well.

From Tables 3 and 5, we observe that the approximation of the information is reasonably good and it gets better with $N$ the number of unlabeled data. Besides, Tables 2 and 4 demonstrate that our algorithm is sensitive to the choice of the underlying estimator $\hat{p}_{k}$ . Indeed, when $\hat{p}_{k}$ is estimated via the softmax regression, our algorithm fails to give a good approximation to the error of the $\beta$ -Oracle.

Table 4 provides similar conclusions regarding $\hat{\Gamma}_{\operatorname{SSE}}$ , though, unlike the theoretical quantities, there are more scenarios where our method is better than its $\operatorname{top}$ - $\beta$ counterpart. Let us point out, that for $K=100$ methods that are based on the softmax regression perform poorly in this setup.

5 Discussions

5.1 Around continuity Assumption 1.1

The bedrock of this paper is Assumption 1.1. Based on it, we ensure that the $\beta$ -Oracle confidence set given by Eq. (1.1) is indeed of information $\beta$ . On top of that, the explicit formulation of excess risk in Proposition 3.1 relies on the continuity of function $G(\cdot)$ . Should Assumption 1.1 fail to be satisfied, then there might be no $\beta$ -Oracle given by thresholding on some level $\theta\in(0,1)$ . Indeed, assume Assumption 1.1 is not satisfied but one can build a $\beta$ -Oracle having the form $\Gamma^{*}_{\beta}(\cdot)=\left\{k\in[K]\,:\,p_{k}(\cdot)>\theta\right\}$ with some $\theta$ , then

[TABLE]

However, without the continuity, the function $G(\cdot)$ is not surjective and therefore, the equation $G(\theta)=\beta$ may have no solution, which contradicts the fact that $\operatorname{I}(\Gamma^{*}_{\beta})=\beta$ . Therefore, the settings without the continuity of $G(\cdot)$ deserve a separate study. Let us also point out that the continuity assumption implies that the $\beta$ -Oracle can also be defined as

[TABLE]

where the inequality used in place of the equality. Indeed, under continuity assumption thanks to Propositions 1.3 and 3.1 we have for all confidence sets $\Gamma$ such that $\operatorname{I}(\Gamma)\leq\beta$

[TABLE]

which implies that the $\beta$ -Oracle $\Gamma^{*}_{\beta}$ is a minimizer.

5.2 Around Lipschitz continuity of $G^{-1}(\cdot)$

Under the assumptions needed in this work, and in particular the continuity assumption we showed two important facts: i) no supervised approach can achieve fast rates, that is, faster than $n^{-1/2}$ ; ii) some semi-supervised approaches can achieve fast rate.

One might wonder whether extra assumptions on the problem allow a supervised method to get faster rates than $n^{-1/2}$ . We give to this question a partial answer following the recent work of Bobkov and Ledoux (2016) and more precisely their Theorem 5.11. Applying this result to our framework, we can state that there exists a positive constant $c$ such that

[TABLE]

where $\operatorname{Lip}(G^{-1})$ is the Lipschitz constant of $G^{-1}(\cdot)$ and $G^{-1}_{N}(\cdot)$ is the generalized inverse of

[TABLE]

If, on top of the above, one can show that for any $\alpha>0$ and for some positive constant $c^{\prime}$

[TABLE]

then under Lipschitz continuity of $G^{-1}(\cdot)$ , we can prove that

[TABLE]

where $\square$ stands for $\operatorname{SE}$ or $\operatorname{SSE}$ . This would illustrate that both $\hat{\Gamma}_{\operatorname{SE}}$ and $\hat{\Gamma}_{\operatorname{SSE}}$ are statistically equivalent under Lipschitz condition on $G^{-1}(\cdot)$ , that is, both reach the same rate and the impact of the unlabeled data $\mathcal{D}_{N}$ is negligible. We plan to further investigate the influence of this Lipschitz condition on the minimax rates of convergence in our future works. Since in the present contribution we do not impose this assumption on $G^{-1}(\cdot)$ , the upper bound of Bobkov and Ledoux (2016) is not applicable and we had to rely on a different approach.

5.3 Around extra logarithm

Theorems 3.5 and 4.2 demonstrate that for the excess risk and the Discrepancy, the upper and the lower bounds differ by a logarithmic factor. As we have already pointed out, this factor appears in the upper bounds due to Lemma 4.5 which relates the difference between two thresholds to the infinity norm. One might hope that if we manage to replace the infinity norm by any other $\ell_{q}$ -norm on the right hand side of the inequality in Lemma 4.5 this logarithm can be eliminated. Unfortunately, it appears that this bound is actually tight, in a sense that one can construct a distribution $\mathbb{P}$ and an estimator $\hat{p}_{k}$ for all $k\in[K]$ such that an equality is achieved in Lemma 4.5. These arguments suggest that the obtained upper bound should be optimal. They also imply that the lower bounds could be further refined to get an extra logarithmic factor. Let us also mention that the continuity Assumption 1.1 in combination with the margin Assumption 2.1 are main obstacles that did not allow us to provide better lower bounds. Nevertheless, our proofs are already involved and our results allow to make non trivial conclusions even without going into the details concerning the logarithms.

6 Conclusion

In this work we have studied the minimax settings of confidence set multi-class classification. First of all, following previous works we have shown that a top- $\beta$ type procedure is inconsistent in our settings and more involved estimators should be proposed. Besides, we have demonstrated that no supervised estimator can achieve rates that are faster than $n^{-1/2}$ , which stays in contrast with other classical settings. Additionally, we have shown that fast rates are achievable for semi-supervised techniques provided that the size of the unlabeled sample is large enough. Consecutively, we have established that our lower bounds are either optimal or nearly optimal by providing a supervised and a semi-supervised estimators which are tractable in practice. Our future works shall be focused on the Lipschitz condition of $G^{-1}(\cdot)$ discussed in Section 5.2, in particular, we want to understand how this extra assumption affects our lower bounds.

Appendix A Technical results

Here we provide proofs for our result. This Appendix is composed of the following part: in Appendix A we introduce some technical results used for our proofs; Appendix B is devoted to the proofs of the upper bounds; Appendix C provides with the proofs our our main lower bounds; finally, in Appendix D we prove the inconsistency of top- $\beta$ approaches.

In this section we gather several technical results which are used to derive the contributions of this work. Let us start by introducing notation used in the appendix. Given any two probability measures $\mathbb{P}_{1},\mathbb{P}_{2}$ on some space measurable space $(\mathcal{X},\mathcal{A})$ the Kullback–Leibler divergence between $\mathbb{P}_{1}$ and $\mathbb{P}_{2}$ is defined as

[TABLE]

and the total variation distance is defined as

[TABLE]

We start with Fano’s inequality in the form proved by (Birgé, 2005).

Lemma A.1 (Fano’s inequality (Birgé, 2005)).

Let $\left\{\mathbb{P}_{i}\right\}_{i=0}^{m}$ be a finite family of probability measures on $(\mathcal{X},\mathcal{A})$ and let $\left\{A_{i}\right\}_{i=0}^{m}$ be a finite family of disjoint events such that $A_{i}\in\mathcal{A}$ for each $i=0,\ldots,m$ . Then,

[TABLE]

Lemma A.2 (Pinsker’s inequality).

Given any two probability measures $\mathbb{P}_{1},\mathbb{P}_{2}$ on some measurable space $(\mathcal{X},\mathcal{A})$ we have

[TABLE]

Lemma A.3 (Hoeffding’s inequality (Hoeffding, 1963)).

Let $b>0$ be a real number, and $N$ be a positive integer. Let $X_{1},\ldots,X_{N}$ be $N$ random variables having values in $[0,b]$ , then

[TABLE]

Proposition A.4 (Properties of the generalized inverse).

Let $X\in\mathbb{R}^{d}$ and $\mathbb{P}_{X}$ be a Borel measure on $\mathbb{R}^{d}$ , let $p:\mathbb{R}^{d}\rightarrow[0,1]^{K}$ be a vector function, we define for all $t\in[0,1]$ and all $\beta\in(0,K)$

[TABLE]

Then,

•

for all $t\in(0,1)$ and $\beta\in(0,K)$ we have $G^{-1}(\beta)\leq t\iff G(t)\leq\beta$ .

•

if for all $k\in[K]$ the mappings $t\mapsto\mathbb{P}_{X}(p_{k}(X)>t)$ are continuous on $(0,1)$ , then

–

for all $\beta\in(0,K)$ we have $G(G^{-1}(\beta))=\beta$ .

The next result is an analogue of the classical inverse transform theorem (van der Vaart, 1998, Lemma 21.1) and was already established by Denis and Hebiri (2017).

Lemma A.5.

Let $\varepsilon$ distributed from a uniform distribution on $[K]$ and $Z_{1},\ldots,Z_{K}$ , $K$ real valued random variables independent from $\varepsilon$ , such that the function $t\mapsto H(t)$ defined as

[TABLE]

is continuous. Consider random variable $Z=\sum_{k=1}^{K}Z_{k}{\bf 1}_{\{\varepsilon=k\}}$ and let $U$ be distributed according to the uniform distribution on $[0,1]$ . Then

[TABLE]

where $H^{-1}$ denotes the generalized inverse of $H$ .

Proof.

First we note that for every $t\in[0,1]$ , $\mathbb{P}\left(H(Z)\leq t\right)=\mathbb{P}\left(Z\leq H^{-1}(t)\right)$ . Moreover, we have

[TABLE]

To conclude the proof, we observe that

[TABLE]

∎

Appendix B Upper bounds

In this section we prove Theorems 4.1 and 4.2. It will be clear from our analysis that the proof of Theorem 4.1 follows directly from Theorem 4.2 by setting $N=0$ in the statement of Theorem 4.2. Thus, in this section for simplicity we omit the subscript $\operatorname{SSE}$ from $\hat{\Gamma}_{\operatorname{SSE}}$ . Recall that our dataset consists of three parts $\mathcal{D}_{\lfloor n/2\rfloor},\mathcal{D}_{\lceil n/2\rceil},\mathcal{D}_{N}$ . The set $\mathcal{D}_{\lfloor n/2\rfloor}$ is used to construct an estimator $\hat{p}$ of the regression function $p$ , that is, $\hat{p}$ is independent from both $\mathcal{D}_{\lceil n/2\rceil},\mathcal{D}_{N}$ . The other two sets $\mathcal{D}_{\lceil n/2\rceil},\mathcal{D}_{N}$ are used in a semi-supervised manner to estimate the threshold, that is, we erase the labels from $\mathcal{D}_{\lceil n/2\rceil}$ . Let $\beta\in[K-1]$ , and also recall the definition of the proposed semi-supervised estimator for a given $x\in\mathbb{R}^{d}$

[TABLE]

with $\hat{p}_{k}(x)$ satisfying Assumptions 4.4, 4.3 for all $k\in[K]$ . Moreover, $\hat{G}^{-1}(\beta)$ defined as the generalized inverse of

[TABLE]

where $t\in[0,1]$ . Additionally, recall that the $\beta$ -Oracle is given as

[TABLE]

where $G^{-1}(\cdot)$ is the generalized inverse of

[TABLE]

Lastly, let us re-introduce an idealized version $\tilde{\Gamma}$ of the proposed estimator $\hat{\Gamma}$ which ’knows’ the marginal distribution $\mathbb{P}_{X}$ of the feature vector $X\in\mathbb{R}^{d}$ as

[TABLE]

with $\tilde{G}\vcentcolon=\sum_{k=1}^{K}\mathbb{P}_{X}(\hat{p}_{k}(X)>t)$ , conditionally on the data. The following result is needed to relate the threshold $\tilde{G}^{-1}(\beta)$ of $\tilde{\Gamma}$ to the true value of the threshold $G^{-1}(\beta)$ .

Lemma B.1 (Upper-bound on the thresholds).

Let $X\in\mathbb{R}^{d}$ and $\mathbb{P}_{X}$ be a Borel measure on $\mathbb{R}^{d}$ . For two vector functions $p,\hat{p}:\mathbb{R}^{d}\rightarrow[0,1]^{K}$ , we define

[TABLE]

If for all $k\in[K]$ the mapping $t\mapsto\mathbb{P}_{X}(p_{k}(X)>t)$ is continuous on $(0,1)$ , then for every $\beta\in(0,K)$

[TABLE]

Proof.

The proof of this result is very similar to the proof of (Bobkov and Ledoux, 2016, Theorem 2.12). We start by defining the following quantity

[TABLE]

Due to the definition of $h^{*}$ we have that for all $t\in[0,1]$

[TABLE]

that is, applying Proposition A.4 to the second inequality we get for all $t\in[0,1]$

[TABLE]

thus, for $t=G^{-1}(\beta)$ with $\beta\in(0,K)$ thanks to Proposition A.4 we get

[TABLE]

The inequality $\tilde{G}^{-1}(\beta)-G^{-1}(\beta)\leq h^{*}$ is obtained in the same way. Thus, we have proved that

[TABLE]

Finally, notice that for all $t\in[0,1]$

[TABLE]

where we used the fact that for all $k\in[K]$

[TABLE]

and $\mathbb{P}_{X}\left(\left|\hat{p}_{k}(X)-p_{k}(X)\right|\leq\left\lVert\hat{p}-p\right\rVert_{\infty,\mathbb{P}_{X}}\right)=1$ . Therefore by definition of $h^{*}$ , we can write $h^{*}\leq\left\lVert\hat{p}-p\right\rVert_{\infty,\mathbb{P}_{X}}$ and we conclude. ∎

We are in position to prove Theorem 4.2, let us point out that the most difficult part in Theorem 4.2 is the upper-bound on the excess risk. The upper-bound on the discrepancy follows the same arguments as the ones we use for the excess-risk.

Excess risk and discrepancy: to upper-bound the excess risk we first separate it into two parts as

[TABLE]

Recall that thanks to Proposition 3.1 we have

[TABLE]

Moreover, let us point out that if some $k\in\tilde{\Gamma}(X)\triangle\Gamma^{*}_{\beta}(X)$ then either

[TABLE]

holds. Thus on the event $k\in\tilde{\Gamma}(X)\triangle\Gamma^{*}_{\beta}(X)$ we have

[TABLE]

Therefore, for $R_{1}$ using Lemma B.1 and the observations above we can write

[TABLE]

finally, using the margin Assumption 2.1 we get almost surely data

[TABLE]

Integrating over the data from both sides and using Assumption 4.3 we get

[TABLE]

For $R_{2}$ the following trivial upper-bound holds

[TABLE]

now, thanks to the first property of Proposition A.4 we can write

[TABLE]

To finish our proof we make use of the peeling technique of (Audibert and Tsybakov, 2007, Lemma 3.1). That is, we define for $\delta>0$ and $k\in[K]$

[TABLE]

Since, for every $k\in[K]$ , the events $({A}^{k}_{j})_{j\geq 0}$ are mutually exclusive, we deduce

[TABLE]

Now, we consider $\varepsilon$ uniformly distributed on $[K]$ independent of the data and $X$ . Conditional on the data and under Assumption 4.4, we apply Lemma A.5 with $Z_{k}=\hat{p}_{k}(X)$ , $Z=\sum_{k=1}^{K}Z_{k}{\bf 1}_{\{\varepsilon=k\}}$ and then obtain that $\tilde{G}(Z)$ is uniformly distributed on $[0,K]$ . Therefore, for all $j\geq 0$ and $\delta>0$ , we deduce

[TABLE]

Hence, for all $j\geq 0$ , we obtain

[TABLE]

Next, we observe that for all $j\geq 1$

[TABLE]

Thus, we obtain that

[TABLE]

almost surely data. Integrating from both sides with respect to the data we get

[TABLE]

recall that the function ${\bf 1}_{\left\{A_{j}^{k}\right\}}$ for all $j\geq 0$ and $k\in[K]$ is independent from $\mathcal{D}_{\lceil n/2\rceil},\mathcal{D}_{N}$ , thus we can write

[TABLE]

Now, since conditional on $(\mathcal{D}_{\lfloor n/2\rfloor},X)$ , $\hat{G}(\hat{p}_{k}(X))$ is an empirical mean of *i.i.d. *random variables of common mean $\tilde{G}(\hat{p}_{k}(X))\in[0,K]$ , we deduce from Hoeffding’s inequality that

[TABLE]

Therefore, treating $A_{0}^{k}$ separately, we get from inequalities of Eqs. (B.3), (B.4), and (B.5)

[TABLE]

Finally, choosing $\delta=\dfrac{K}{\sqrt{N+\lceil n/2\rceil}}$ in the above inequality, we finish the proof.

Hamming risk: here we provide an upper bound on the Hamming risk. First, by the triangle inequality we can write for the proposed estimator $\hat{\Gamma}$ and the pseudo Oracle $\beta$ set $\tilde{\Gamma}$

[TABLE]

Notice that for the term $\mathbb{E}_{(\mathcal{D}_{n},\mathcal{D}_{N})}\mathbb{E}_{X\sim\mathbb{P}_{X}}\left|\hat{\Gamma}(X)\triangle\tilde{\Gamma}(X)\right|$ we can re-use the proof technique used for the term $R_{2}$ in Eq. (B). Thus, it remain to upper-bound the term $\mathbb{E}_{(\mathcal{D}_{n},\mathcal{D}_{N})}\mathbb{E}_{X\sim\mathbb{P}_{X}}\left|\tilde{\Gamma}(X)\triangle\Gamma^{*}_{\beta}(X)\right|$ . The proof on this part closely follows the machinery used in Denis and Hebiri (2017), however, let us mention that they used this method to obtain a bound on the Discrepancy which leads to a sub-optimal rate. Nevertheless, their approach gives a correct rate if instead of the Discrepancy we bound the Hamming distance. For the sake of completeness we write the principal parts of the proof here.

First of all, by the definition of sets $\Gamma^{*}_{\beta}$ and $\tilde{\Gamma}$ we can write for $(*)=\mathbb{E}_{X\sim\mathbb{P}_{X}}\left|\tilde{\Gamma}(X)\triangle\Gamma^{*}_{\beta}(X)\right|$

[TABLE]

Now if $\hat{p}_{k}(X)\geq\tilde{G}^{-1}(\beta)$ and ${p}_{k}(X)<{G}^{-1}(\beta)$ we can have the following situations

•

if $\tilde{G}^{-1}(\beta)>{G}^{-1}(\beta)$ , then $\left|{p}_{k}(X)-{G}^{-1}(\beta)\right|\leq\left|\hat{p}_{k}(X)-p_{k}(X)\right|$ ;

•

if $\tilde{G}^{-1}(\beta)\leq{G}^{-1}(\beta)$ , then either $\left|{p}_{k}(X)-{G}^{-1}(\beta)\right|\leq\left|\hat{p}_{k}(X)-p_{k}(X)\right|$ or $\hat{p}_{k}(X)\in\left(\tilde{G}^{-1}(\beta),{G}^{-1}(\beta)\right)$ ;

Similar conditions are satisfied if $\hat{p}_{k}(X)<\tilde{G}^{-1}(\beta)$ and ${p}_{k}(X)\geq{G}^{-1}(\beta)$ . Using the above arguments we can upper-bound $(*)$ as

[TABLE]

Thanks to the continuity Assumption 4.4 on the estimator and the continuity Assumption 1.1 on the distribution we clearly have $\tilde{G}\left(\tilde{G}^{-1}(\beta)\right)=\beta={G}\left({G}^{-1}(\beta)\right)$ . Moreover, we can write

[TABLE]

Thus, our bound reads as

[TABLE]

Finally, in order to upper-bound the term above one can use the peeling argument of Audibert and Tsybakov (2007) applied with the exponential concentration inequality provided by Assumption 4.3. This part of the proof we omit here and refer the reader to Denis and Hebiri (2017) or to Audibert and Tsybakov (2007) for a complete result.

Let us emphasize that the argument above is only possible due to the continuity Assumptions 1.1, 4.4 on the distribution and the estimator respectively.

Appendix C Proof of the lower bounds

This section is devoted to the proof of the lower bounds provided by Theorems 3.4-3.5. Before proceeding to the proofs let us briefly sketch the high-level strategy used in this work. In order to prove the lower bounds of Theorems 3.4-3.5 we actually prove to separate lower bounds on the minimax risk. Clearly, if some non-negative quantity is lower-bounded by two different values, therefore it is lower-bounded by the maximum between the two. The two lower bounds that we prove are naturally connected with the proposed two-steps estimator, that is, the first lower bound is connected with the problem of non-parametric estimation of $p_{k}$ for all $k\in[K]$ and the second describes the estimation of the unknown threshold $G^{-1}(\beta)$ .

In particular, the first lower bound is closely related to the one provided in (Audibert and Tsybakov, 2007; Rigollet and Vert, 2009), though, crucially the continuity Assumption 1.1 makes the proof more involved. The second lower bound is based on two hypotheses testing and is derived by constructing two different marginal distributions of $X\in\mathbb{R}^{d}$ and a fixed regression vector $p(\cdot)$ . In this part we make use of Pinsker’s inequality recalled in Lemma A.2.

In order to discriminate the supervised and the semi-supervised procedures we make use of Definition 1.4. Notice that every supervised procedure thanks to Definition 1.4 is not ’sensitive’ to the expectation taken *w.r.t. *the unlabeled dataset $\mathcal{D}_{N}$ , that is, randomness is only induced by the labeled dataset $\mathcal{D}_{n}$ . This strategy allows to eliminate the dependence of the lower bound on the size of the unlabeled dataset $\mathcal{D}_{N}$ for supervised procedures. Indeed, let $\hat{\Gamma}$ be any supervised estimator in the sense of Definition 1.4, then for any real valued function of confidence sets $Z$ we have

[TABLE]

with $\mathcal{D}^{\prime}_{N}$ being an arbitrary set of $N$ points in $\mathbb{R}^{d}$ .

C.1 Part I: $(N+n)^{-1/2}$

Here we prove that the rate $(N+n)^{-1/2}$ is optimal for semi-supervised methods, as already mentioned the rate for the supervised methods can be obtained by formally setting $N=0$ . The constant $C^{\prime},C,c$ are always assumed to be independent of $N,n$ and can differ from line to line. Let us fix $\beta\in\{1,\ldots,\lfloor K/2\rfloor-1\}$ and $K\geq 5$ . For a positive constant $C<1/2$ we define the following sequence

[TABLE]

To prove the lower bound we construct two distribution $\mathbb{P}_{0}$ and $\mathbb{P}_{1}$ on $\mathbb{R}^{d}$ sharing the same regression function $p(\cdot)=(p_{1}(\cdot),\ldots,p_{K}(\cdot))$ and with different marginals admitting densities $\mu_{0},\mu_{1}$ . First, for a fixed parameter $0<\rho<1$ and fixed constants $0<r_{0}<r_{1}<r_{2}<r_{3}<r_{4}$ to be specified we define the following sets

[TABLE]

Let us denote by $o_{i}=(r_{i}+\rho,0,\ldots,0)^{\top}$ for $i=1,2,3$ the centers of $\mathcal{X}_{1}$ , $\mathcal{X}_{2}$ and $\mathcal{X}_{3}$ . Using these sets we define the regression vector as

[TABLE]

In order to define the functions $\varphi_{i}$ for $i=0,\ldots,4$ we first define a one dimensional function of two real-valued parameters $a<b$

[TABLE]

Figure 1 illustrates the behavior of $\psi_{a,b}$ function in one dimension. Note that for every $a,b\in\mathbb{R}$ the function above is infinitely smooth. Using the definition of $\psi_{a,b}$ we define the functions $\varphi_{i}$ for $i=0,\ldots,4$ as

[TABLE]

and the constant $C^{\prime}\leq 1$ is chosen small enough so that each function $\varphi_{i}$ for $i=0,\ldots,4$ is $(\gamma,L)$ -Hölder. Let us point out that such value $C^{\prime}$ exists and is independent of $n,N$ , indeed, the mapping

[TABLE]

is infinitely smooth, thus it is $(\gamma,L)$ -Hölder for a properly chosen $C^{\prime}$ . Figure 2 demonstrates the behavior of the considered construction in one dimension. Note that $\varphi_{i}(x)$ for $i=1,3$ are obtained from the previous mapping by re-scaling which preserves the Hölder constant $L$ . Same reasoning applies to $\varphi_{i}$ for $i=0,2,4$ .

Since $\beta<K/2$ one can check that the following relations hold true

[TABLE]

which will help us to ensure that the thresholds under $\mathbb{P}_{0},\mathbb{P}_{1}$ are $\frac{3K+2\beta}{8K\beta}$ and $\frac{K+6\beta}{8K\beta}$ respectively. Now, we define two marginal distributions $\mu_{0},\mu_{1}$ by their densities as

[TABLE]

and both $\mu_{0},\mu_{1}$ are equal to zero in unspecified regions. Clearly, the strong density assumption is satisfied on $\mathcal{X}_{0}$ and $\mathcal{X}_{4}$ since the density is lower and upper-bounded by a constant independent of both $N,n$ . The parameter $\rho$ is chosen such that the strong density assumption on $\mathcal{X}_{i}$ for $i=1,2,3$ is satisfied. Notice that

[TABLE]

for some constant $c>0$ independent of $N,n$ , thus we set $\rho=C(N+n)^{-1/2d}$ . For these hypotheses one can easily check that the thresholds $G^{-1}_{0}(\beta),G^{-1}_{1}(\beta)$ and the optimal $\beta$ -sets $\Gamma^{*}_{0},\Gamma^{*}_{1}$ are given as

[TABLE]

The margin assumption: we are in position to check the margin Assumption 2.1. Let $t_{0}=\frac{1}{2}\left(\frac{K-2\beta}{8K\beta}\bigwedge\frac{1}{4K}\right)$ , thus for every $k\in\{2\beta+1,\ldots,K\}$ and every $t\leq t_{0}$ we have

[TABLE]

moreover for every $k\in\{1,\ldots,2\beta\}$ and every $t\leq t_{0}$ we can write

[TABLE]

Hence, for the [math] hypothesis there exists $c$ independent of $N,n$ such that

[TABLE]

Therefore we can write using the strong density assumption

[TABLE]

Finally notice that for every $x\in\mathbb{R}^{d}$ such that $\left\lVert x\right\rVert\leq 1/2$ we have for some $C>0$

[TABLE]

which implies that for some positive $C,C^{\prime}$ independent of $N,n$ we can write

[TABLE]

This implies that for as long as $\alpha\leq d/(2\lceil\frac{\gamma}{2}\rceil)$ (and since we have $\gamma\leq 2\lceil\frac{\gamma}{2}\rceil$ ) the margin assumption is satisfied. Moreover, these conditions imply that $\alpha\gamma\leq d$ , which we will also require while proving the supervised part of the rate. Same reasoning can be carried out for the case of the first hypothesis $\mathbb{P}_{1}$ on the set $\mathcal{X}_{3}$ .

Finally, the parameters $r_{0},r_{1},r_{2},r_{3}$ are chosen as constants independent of $n,N$ such that there exists a smooth connection between the parts of the regression functions $p_{k}(\cdot)$ which are defined on $\mathcal{X}_{0},\mathcal{X}_{1},\mathcal{X}_{2},\mathcal{X}_{3},\mathcal{X}_{4}$ . Notice that such a choice is possible since by the construction of functions $\varphi_{i}$ for $i=0,1,2,3,4$ they are zeroed-out on the boundaries of $\mathcal{X}_{0},\mathcal{X}_{1},\mathcal{X}_{2},\mathcal{X}_{3},\mathcal{X}_{4}$ . Thus in the region $\mathbb{R}^{d}\setminus\bigcup_{i=0}^{4}\mathcal{X}_{i}$ it is sufficient to construct a function which connects four different constants smoothly. We avoid this over complication on this part and hope that the guidelines provided above are sufficient for the understanding.

Notice that the constructed distributions are satisfying Assumption 1.1 since the measures are only defined on $\mathcal{X}_{0},\mathcal{X}_{1},\mathcal{X}_{2},\mathcal{X}_{3},\mathcal{X}_{4}$ and the regression functions on these sets are not concentrated around any constant.

Before proceeding to the final stage of the proof let us mention that in what follows we use the de Finetti (de Finetti, 1972, 1974) notation which is common in probability. That is, given a probability measure $\mathbb{P}$ on some measurable space $(\Omega_{0},\mathcal{A}_{0})$ and a measurable function $X:(\Omega_{0},\mathcal{A}_{0})\to(\mathbb{R},\text{Borel}(\mathbb{R}))$ we write

[TABLE]

Bound on the KL-divergence: we start by computing the KL-divergence between $\mu_{0}$ and $\mu_{0}$

[TABLE]

Lower bound for the Hamming risk: first of all let us introduce the following notation for $i=0,1$

[TABLE]

Recall that we are interested in the following quantity

[TABLE]

since the hypotheses $\mathbb{P}_{0},\mathbb{P}_{1}\in\mathcal{P}$ we can write

[TABLE]

where $(*)$ is defined as

[TABLE]

thus, for the Hamming risk we can write

[TABLE]

Now we focus our attention to the sum of two Hamming differences which appearing on the right hand side of the above inequality

[TABLE]

Substituting this lower bound into the initial inequality we arrive at

[TABLE]

which implies the desired lower bound on the Hamming risk.

Lower bound for the $\beta$ excess risk: this part is analogues to the case of the Hamming distance. Let us recall that for every $\hat{\Gamma}$ we have the following expression for $i=0,1$

[TABLE]

Again, recall that we are interested in

[TABLE]

similarly to the previous case, since the hypotheses $\mathbb{P}_{0},\mathbb{P}_{1}\in\mathcal{P}$ we can write

[TABLE]

where $(**)$ is defined as

[TABLE]

we can write

[TABLE]

and we continue in a similar fashion

[TABLE]

since $\mu_{0}(x)=\mu_{1}(x)$ for all $x\in\mathcal{X}_{2}$ we obtain

[TABLE]

then, since $\frac{\varphi_{2}(x)}{\beta}\leq\frac{K-2\beta}{8K\beta}$ for all $x\in\mathcal{X}_{2}$ , we have

[TABLE]

Thus,

[TABLE]

Which concludes the first part of the lower bounds.

C.2 Part II: $n^{-\alpha\gamma/(2\gamma+d)}$

In this section we prove that in case of the Hamming risk $\Psi^{\operatorname{H}}$ the rate $n^{-\alpha\gamma/(2\gamma+d)}$ is minimax optimal. Notice, that thanks to Proposition 3.1 a lower bound of order $n^{-\alpha\gamma/(2\gamma+d)}$ on the Hamming risk $\Psi^{\operatorname{H}}$ immediately implies a lower bound of order $n^{-(\alpha+1)\gamma/(2\gamma+d)}$ on both $\Psi^{\operatorname{E}}$ and $\Psi^{\operatorname{D}}$ .

The proof is based on the reduction of the Hamming risk to a multiple hypotheses testing problem and an application of Fano’s inequality provided by Birgé (2005) recalled in Lemma A.1.

Assume that $K\geq 5$ and fix some $\beta\in\{2,\ldots,(K-2)\wedge\lfloor K/2\rfloor\}$ , define the regular grid on $[0,1]^{d}$ as

[TABLE]

and denote by $n_{q}(x)\in G_{q}$ as the closest point to of the grid $G_{q}$ to the point $x\in\mathbb{R}^{d}$ . Such a grid defines a partition of the unit cube $[0,1]^{d}\subset\mathbb{R}^{d}$ denoted by $\mathcal{X}^{\prime}_{1},\ldots,\mathcal{X}^{\prime}_{q^{d}}$ . Besides, denote by $\mathcal{X}^{\prime}_{-j}\coloneqq\{x\in\mathbb{R}^{d}\,:\,-x\in\mathcal{X}_{j}^{\prime}\}$ for all $j=1,\ldots,q^{d}$ . For a fixed integer $m\leq q^{d}$ and for any $j\in\{1,\ldots,m\}$ define $\mathcal{X}_{i}\coloneqq\mathcal{X}_{i}^{\prime}$ , $\mathcal{X}_{-i}\coloneqq\mathcal{X}_{-i}^{\prime}$ . Additionally we introduce the following set $\mathcal{X}_{0}=\mathcal{B}(0,(4q)^{-1})$ . For every $w\in W\vcentcolon=\{-1,1\}^{m}$ we build the distribution $\mathbb{P}_{w}\in\mathcal{P}_{W}$ , such that, the marginal distribution $\mathbb{P}_{w,X}$ is independent of $w\in\{-1,1\}^{m}$ and the regression vector $(p^{w}_{1}(x),\ldots,p^{w}_{K}(x))$ is constructed as

[TABLE]

where $v\in[0,1]$ , $\varphi:\mathbb{R}^{d}\mapsto\mathbb{R}_{+}$ , and $\xi:\mathbb{R}^{d}\mapsto\mathbb{R}_{+}$ are to be specified. The constants $v,c^{\prime}$ are set as

[TABLE]

The function $\xi$ is constructed as

[TABLE]

the function $\bar{u}$ is infinitely many times differentialble, is equal to zero on $(-\infty,0]$ and to one on $[1,+\infty)$ . Figure 3 shows the behavior of $1-\bar{u}$ . Taking the constant $\rho>0$ big enough independently of $N,n$ we can ensure that the function $\xi$ is $(\gamma,L)$ -Hölder.

The function $\phi$ is constructed similarly to the previous part of the rate, that is, for $\phi$ we choose

[TABLE]

with $C_{\phi}$ being sufficiently small such that $\phi(\cdot)$ is $(\gamma,L)$ -Hölder and upper-bounded by $c^{\prime}/2\wedge v/4$ . For the function $\varphi$ we consider the following construction

[TABLE]

where $u_{2}(\cdot)$ is defined as

[TABLE]

Figure 4 explains the behavior of this function and helps for better understanding of our results. The constant $C_{\varphi}$ is chosen in such a way that the constructed function $\varphi(\cdot)$ is $(\gamma,L)$ -Hölder and and upper-bounded by $c^{\prime}/2\wedge v/4$ . Notice that the function $\varphi(x)$ for all $x\in\mathcal{B}(0,(4q)^{-1})$ satisfies

[TABLE]

Finally, the function $g$ is any $(\gamma,L)$ -Hölder function with sufficiently bounded variation which is not concentrated around any constant, for example

[TABLE]

For $C_{g}$ chosen small enough to ensure that it is $(\gamma,L)$ -Hölder and has a bounded by $c^{\prime}/2\wedge v/4$ variation.

It remains to define the marginal distribution of the vector $X\in\mathbb{R}^{d}$ . We select a Euclidean ball in $\mathbb{R}^{d}$ denoted by $A_{0}$ that has an empty intersection with $\mathcal{B}(0,\sqrt{d}+\rho)$ and whose Lebesgue measure is $\operatorname{Leb}(A_{0})=1-mq^{-d}$ . The density $\mu$ of the marginal distribution of $X\in\mathbb{R}^{d}$ is constructed as

•

$\mu(x)=\frac{\tau}{\operatorname{Leb}(\mathcal{B}(0,(4q)^{-1}))}$ for every $z\in G_{q}\cup\{0\}$ and every $x\in\mathcal{B}(z,(4q)^{-1}))$ or $x\in\mathcal{B}(-z,(4q)^{-1}))$ ,

•

$\mu(x)=\frac{1-2m\tau}{\operatorname{Leb}(A_{0})}$ for every $x\in A_{0}$ ,

•

$\mu(x)=0$ for every other $x\in\mathbb{R}^{d}$ ,

for some $\tau$ to be specified. Now, we check that the distributions constructed above belong to the set $\mathcal{P}$ for every $w\in W$ . Namely, we check the following list of assumption

•

The functions $p_{1}^{w},\ldots,p_{K}^{w}$ are defining some regression function for every $w\in W$ . That is, for each $x\in\mathbb{R}^{d}$ we have $\sum_{k=1}^{K}p_{k}^{w}(x)=1$ and $0\leq p_{k}^{w}(x)\leq 1$ ,

•

the functions $p_{1}^{w},\ldots,p_{K}^{w}$ are $(\gamma,L)$ -Hölder,

•

the function $G_{w}(t)\vcentcolon=\sum_{k=1}^{K}\int_{\mathbb{R}^{d}}{\bf 1}_{\{p_{k}^{w}(x)\geq t\}}\mu(x)dx$ is continuous,

•

the threshold $G^{-1}(\beta)$ is equal to $v$ for every $w\in W$ ,

•

the marginal distribution satisfies the strong density assumption,

•

the regression function satisfies $\alpha$ -margin assumption.

The regression function is well defined: to see this, notice that for every $w\in W$ and every $x\in\mathbb{R}^{d}$ we have by construction

[TABLE]

and the combination of both with $v=1/K$ implies that $\sum_{k=1}^{K}p_{k}^{w}(x)=1$ . Moreover, as long as $\sup_{x\in\mathcal{X}_{i}}\varphi(x)\leq v/2$ for every $i=-m,\ldots,-1,1,\ldots,m$ we have for every $x\in\mathbb{R}^{d}$

[TABLE]

and by construction of the function $g$ we have for every $k=1,\ldots,\beta-1$ , every $x\in\mathbb{R}^{d}$ and every $w\in W$

[TABLE]

due to the choice of $c^{\prime},v$ we have

[TABLE]

Similarly, for every $k=\beta+2,\ldots,K$ , every $x\in\mathbb{R}^{d}$ and every $w\in W$

[TABLE]

and with the choice of $v,c^{\prime}$ specified above and the constraint $\beta\leq\lfloor K/2\rfloor$ we have

[TABLE]

Thus, the construction above defines some regression function for every $w\in W$ .

The regression function is $(\gamma,L)$ -Hölder: this implication follows immediately from the construction of $\varphi,\xi,g$ .

Continuity of $G(t)$ : first let us show that $\int_{\mathbb{R}^{d}}{\bf 1}_{p_{k}^{w}(x)\geq t}\mu(x)dx$ is continuous for every $k\in[K]$ . For $k=1,\ldots,\beta-1,\beta+2,\ldots,K$ the continuity follows from the fact that $g$ is not concentrated around any constant. For $k=\beta,\beta+1$ we first write

[TABLE]

thus for this choice of $k$ the continuity follows from the fact that $\varphi$ and $g$ are not concentrated around any constant.

Threshold $G^{-1}(\beta)=v$ : to see this notice that for every $w\in W$ ,

[TABLE]

and the condition on the threshold follows from the continuity of $G(\cdot)$ . Besides, the corresponding $\beta$ -Oracle sets $\Gamma^{*}_{w}$ are given for every $w\in W$ as

[TABLE]

The strong density assumption: the strong density assumption can be checked following the proof of (Audibert and Tsybakov, 2007, Theorem 3.5) where an analogous construction of the marginal distribution was considered.

$\alpha$ -margin assumption: for all $t\leq t_{0}\vcentcolon=v/4$ , all $k\in[K]\setminus\{\beta,\beta+1\}$ and all $w\in W$ we have

[TABLE]

thus for $k\in[K]\setminus\{\beta,\beta+1\}$ the margin assumption is satisfied. It remains to check that the margin assumption is satisfied for $k\in\{\beta,\beta+1\}$ . Fix an arbitrary $w\in W$ and $k=\beta$ , then for all $t\leq t_{0}$ we can write

[TABLE]

We separately upper-bound both terms which appear on the right hand side of the equality.

[TABLE]

clearly there exists a constant $C$ such that for all $x\in\mathcal{B}(0,1/2)$ we have

[TABLE]

Therefore for some constant $C>0$ we can write

[TABLE]

thanks to the strong density assumption we can write for some $C>0$

[TABLE]

Thus since $1-\gamma/2\lceil\frac{\gamma}{2}\rceil\geq 0$ and $d/2\lceil\frac{\gamma}{2}\rceil\geq\alpha$ we can write for some $C>0$

[TABLE]

To finish this part it remains to upper-bound the other term in the margin assumption

[TABLE]

using the fact that the function $\varphi(x)$ for all $x\in\mathcal{B}(0,(4q)^{-1})$ satisfies

[TABLE]

we can write for all $t\leq C_{\varphi}q^{-\gamma}$

[TABLE]

moreover, for all $t\geq 2C_{\varphi}q^{-\gamma}$ we can write

[TABLE]

and finally for $t\in(C_{\varphi}q^{-\gamma},2C_{\varphi}q^{-\gamma})$ we can write

[TABLE]

The above implies that for some constant $C>0$ we have for all $t\leq t_{0}$

[TABLE]

Thus the margin assumption is satisfied as long as

•

$\tau m=\mathcal{O}(q^{-\gamma\alpha})$ ;

•

$2\lceil\frac{\gamma}{2}\rceil\alpha\leq d$ .

Similarly one can check that the margin assumption is satisfied for $k=\beta+1$ Bound on the KL-divergence: we are in position to upper-bound the KL divergence between any two hypotheses. Fix some $w,w^{\prime}\in W$ , then using the upper bound on $\varphi(\cdot)$ we can write for some $C>0$

[TABLE]

How many hypotheses to take: let us recall the following result which is a version of Varshamov-Gilbert bound (Gilbert, 1952; Varshamov, 1957).

Lemma C.1.

Let $\delta(w,w^{\prime})$ denote the Hamming distance between $w,w^{\prime}\in W$ given by

[TABLE]

There exists $\mathcal{W}\subset W$ such that for all $w\neq w^{\prime}\in\mathcal{W}$ we have

[TABLE]

and $\log\left|\mathcal{W}\right|\geq\frac{m}{8}$ .

Denote $\mathcal{W}\subset W$ the set provided by Lemma C.1 and by $\mathcal{P}_{\mathcal{W}}$ the set of distributions $\mathbb{P}^{w}$ with $w\in\mathcal{W}$ . Taking into account all the above we conclude that $\mathcal{P}_{\mathcal{W}}$ satisfies the assumptions of our result.

Lower bound on the Hamming risk (applying Birgé’s Lemma A.1): finally, we are in position to lower bound the hamming risk. Recall that we are interested in the following quantity

[TABLE]

The rest of the proof follows standard arguments, which again using the de Finetti notation read as

[TABLE]

Denote by $\hat{w}$ the following minimizer

[TABLE]

thus if $w\neq\hat{w}$ we can write using the definition of $\hat{w}$ and the triangle inequality

[TABLE]

These arguments and Birge’s lemma A.1 imply that

[TABLE]

Since the marginal distribution of the vector $X\in\mathbb{R}^{d}$ is shared among the hypotheses, using the upper-bound on the $\operatorname{KL}$ -divergence and the conditions on $\mathcal{W}$ we get for some $C>0$

[TABLE]

Finally, let $q=\lfloor\bar{C}n^{1/(2\gamma+d)}\rfloor$ , $\tau=\lfloor C^{\prime}q^{-d}\rfloor$ and $m=\lfloor C^{\prime\prime}q^{d-\alpha\gamma}\rfloor$ for some $\bar{C},C^{\prime},C^{\prime\prime}>0$ small enough we get for some $C>0$ and $c<1$

[TABLE]

One can easily verify that this choice of parameters $\tau,m,q$ is possible as long as $2\lceil\frac{\gamma}{2}\rceil\alpha\leq d$ and clearly with our choice we have $\tau m=\mathcal{O}(q^{-\alpha\gamma})$ . As already mentioned the lower bound for the excess risk and the discrepancy follows from Propositions 3.1 and 3.2.

Appendix D Inconsistency of top- $\beta$ approach

In this section we prove Proposition 3.3. The proof builds an explicit construction of a distribution $\mathbb{P}$ whose $\beta$ -Oracle satisfies $\lvert\Gamma^{*}_{\beta}(x)\rvert>\beta$ for all $x$ in some $A\subset\mathbb{R}^{d}$ with $\mathbb{P}_{X}(A)>0$ . Clearly, if such a distribution exists then there is no estimator in $\hat{\Upsilon}_{\beta}$ that would consistently estimate this $\beta$ -Oracle. Let $\beta\in[0,\ldots,\lfloor K/2\rfloor-1]$ be a fixed integer and $K\geq 3$ . For the proof of the theorem we shall construct one distribution $\mathbb{P}$ for which none of the estimators with a fixed information can perform well. We start by specifying the marginal distribution of $X\in\mathbb{R}^{d}$ . We start the construction by specifying the density $\mu$ of the marginal distribution $\mathbb{P}_{X}$ . Define a disk in $\mathbb{R}^{d}$ for some positive $r\leq r^{\prime}$ as $\mathcal{D}(r,r^{\prime})=\left\{x\in\mathbb{R}^{d}\,:\,r\leq\left\lVert x\right\rVert\leq r^{\prime}\right\}$ . First of all fix some parameters $r_{1}<r_{2}<2r_{2}<r_{3}<2r_{3}$ which are independent from $n,N$ . The density $\mu$ is supported on $\mathcal{B}(0,r_{1})\cup\mathcal{D}(r_{2},2r_{2})\cup\mathcal{D}(r_{3},2r_{3})$ .

Moreover,

•

$\mu(x)=\frac{\tfrac{\beta}{\beta+1}-\operatorname{Leb}(\mathcal{B}(0,r_{1}))}{\operatorname{Leb}\left(\mathcal{D}(r_{2},2r_{2})\right)}$ for all $x\in\mathcal{D}(r_{2},2r_{2})$ ,

•

$\mu(x)=\frac{1}{(\beta+1)\operatorname{Leb}\left(\mathcal{D}(r_{3},2r_{3})\right)}$ for all $x\in\mathcal{D}(r_{3},2r_{3})$ ,

•

$\mu(x)=1$ , for all $x\in\mathcal{B}(0,r_{1})$ ,

•

$\mu(x)=0$ otherwise,

where $r_{1}>0$ is chosen small enough to ensure that $\tfrac{\beta}{\beta+1}-\operatorname{Leb}(\mathcal{B}(0,r_{1}))>0$ . The regression function $p(\cdot)=(p_{1}(\cdot),\ldots,p_{K}(\cdot))^{\top}$ are defined as

[TABLE]

where the constant $C_{L}$ is chosen small enough to ensure that these functions are $(\gamma,L)$ -Hölder and have sufficiently small variation. Consider an arbitrary infinitely many times differentiable function $v:\mathbb{R}\mapsto[0,1]$ which satisfies $v(x)=0$ for all $x\leq 0$ and $v(x)=1$ for all $x\geq 1$ . Then, the functions $g(\cdot)$ and $\xi(\cdot)$ are defined as $g(x)=\frac{1}{2}v\left(\frac{\left\lVert x\right\rVert-r_{1}}{r_{2}-r_{1}}\right)$ , $\xi(x)=\frac{3}{4}v\left(\frac{\left\lVert x\right\rVert-2r_{2}}{r_{3}-2r_{2}}\right)$ . The above construction defines a distribution $\mathbb{P}$ for which we have

[TABLE]

Indeed, let us evaluate the following quantity under the assumption that $\beta\leq\lfloor K/2\rfloor-1$

[TABLE]

Thus, using this distribution we can write for any classifier $\hat{\Gamma}\in\hat{\Upsilon}_{\beta}$ with fixed cardinal

[TABLE]

where the first inequality follows from the observation that for $x\in\mathcal{D}(r_{2},2r_{2})$ there is always at least one label $k$ such that $k\in\hat{\Gamma}(x)\triangle\Gamma^{*}(x)$ . Thus, since the constant $C_{L}$ is chosen to satisfy $2C_{L}/(\beta+1)\leq 1/4(\beta+1)$ we have for any $\hat{\Gamma}\in\hat{\Upsilon}_{\beta}$

[TABLE]

If $r_{1}$ is such that $\operatorname{Leb}(\mathcal{B}(0,r_{1}))\leq\tfrac{\beta}{2(\beta+1)}$ we get

[TABLE]

By construction, the regression vector is $(\gamma,L)$ -Hölder and the density is lower- and upper-bounded by some positive constants. Hence, it remains to check that the constructed distribution satisfies the $\alpha$ -margin assumption. This can be achieved by an appropriate choice of $r_{1}$ . Indeed, on the sets $\mathcal{D}(r_{2},2r_{2})\cup\mathcal{D}(r_{3},2r_{3})$ there is a “corridor” of constant size between the regression functions and the threshold $G^{-1}(\beta)$ . The threshold $G^{-1}(\beta)$ is only approached by the regression function on the set $\mathcal{B}(0,r_{1})$ . As all the parameters in our construction are independent from $n,N\in\mathbb{N}$ we can find a value $r_{1}$ being small enough so that the $\alpha$ -margin assumption is verified for a fixed $\alpha>0$ .

Bibliography38

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anbar (1977) {barticle} [author] \bauthor \bsnm Anbar, \bfnm D \binits D. ( \byear 1977). \btitle A Modified Robbins-Monro Procedure Approximating the Zero of a Regression Function from Below. \bjournal Ann. Statist. \bvolume 5 \bpages 229–234. \endbibitem
2Audibert and Tsybakov (2007) {barticle} [author] \bauthor \bsnm Audibert, \bfnm J. -Y. \binits J. and \bauthor \bsnm Tsybakov, \bfnm A. \binits A. ( \byear 2007). \btitle Fast learning rates for plug-in classifiers. \bjournal Ann. Statist. \bvolume 35 \bpages 608–633. \endbibitem
3Bartlett and Wegkamp (2008) {barticle} [author] \bauthor \bsnm Bartlett, \bfnm P. \binits P. and \bauthor \bsnm Wegkamp, \bfnm M. \binits M. ( \byear 2008). \btitle Classification with a reject option using a hinge loss. \bjournal J. Mach. Learn. Res. \bvolume 9 \bpages 1823–1840. \endbibitem
4Bellec et al. (2018) {barticle} [author] \bauthor \bsnm Bellec, \bfnm P. C. . \binits P., \bauthor \bsnm Dalalyan, \bfnm A. S. . \binits A., \bauthor \bsnm Grappin, \bfnm E \binits E. and \bauthor \bsnm Paris, \bfnm Q \binits Q. ( \byear 2018). \btitle On the prediction loss of the lasso in the partially labeled setting. \bjournal Electron. J. Statist. \bvolume 12 \bpages 3443–3472. \endbibitem
5Birgé (2005) {barticle} [author] \bauthor \bsnm Birgé, \bfnm L. \binits L. ( \byear 2005). \btitle A new lower bound for multiple hypothesis testing. \bjournal IEEE Trans. Inform. Theory \bvolume 51. \endbibitem
6Bobkov and Ledoux (2016) {barticle} [author] \bauthor \bsnm Bobkov, \bfnm S. \binits S. and \bauthor \bsnm Ledoux, \bfnm M. \binits M. ( \byear 2016). \btitle One-dimensional empirical measures, order statistics and Kantorovich transport distances. \bnote to appear in the Memoirs of the Amer. Math. Soc. \endbibitem
7Brown and Low (1996) {barticle} [author] \bauthor \bsnm Brown, \bfnm L \binits L. and \bauthor \bsnm Low, \bfnm M \binits M. ( \byear 1996). \btitle A constrained risk inequality with applications to nonparametric functional estimation. \bjournal Ann. Statist \bvolume 24 \bpages 2524–2535. \endbibitem
8Chow (1957) {barticle} [author] \bauthor \bsnm Chow, \bfnm C. -K. \binits C. ( \byear 1957). \btitle An optimum character recognition system using decision functions. \bjournal IRE Transactions on Electronic Computers \bvolume 4 \bpages 247–254. \endbibitem

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Minimax semi-supervised confidence sets for multi-class classification

Abstract

keywords:

keywords:

1 Introduction

1.1 Problem statement

Assumption 1.1** (Continuity of CDF).**

Proposition 1.2** (β\betaβ-Oracle confidence set).**

Proposition 1.3**.**

1.2 Confidence set estimators

Definition 1.4** (Supervised and semi-supervised estimators).**

1.3 Minimax estimation

Definition 1.5** (Minimax rate of convergence).**

1.4 Related works

1.5 Main contributions

1.6 Organization of the paper

2 Class of confidence sets

Assumption 2.1** (α\alphaα-margin assumption).**

Definition 2.2** (Strong density).**

Definition 2.3** (Hölder class, Tsybakov (2008)).**

Definition 2.4**.**

3 Lower bounds

Proposition 3.1**.**

Proposition 3.2**.**

3.1 Inconsistency of the top-β\betaβ procedure

Proposition 3.3**.**

3.2 Supervised vs semi-supervised estimation

Theorem 3.4** (Supervised estimation).**

Theorem 3.5** (Semi-supervised estimation).**

Corollary 3.6**.**

3.3 Sketch of the proof

4 Upper bounds

Theorem 4.1** (Supervised estimation).**

Theorem 4.2** (Semi-supervised estimation).**

4.1 Construction of the estimators

Assumption 4.3** (Exponential concentration).**

Assumption 4.4** (Continuity of CDF).**

4.2 Properties of the plug-in confidence sets

Lemma 4.5** (Upper bound on the thresholds).**

Proposition 4.6** (Denis and Hebiri (2017)).**

4.3 Simulation study

5 Discussions

5.1 Around continuity Assumption 1.1

5.2 Around Lipschitz continuity of G−1(⋅)G^{-1}(\cdot)G−1(⋅)

5.3 Around extra logarithm

6 Conclusion

Appendix A Technical results

Lemma A.1** (Fano’s inequality (Birgé, 2005)).**

Lemma A.2** (Pinsker’s inequality).**

Lemma A.3** (Hoeffding’s inequality (Hoeffding, 1963)).**

Proposition A.4** (Properties of the generalized inverse).**

Lemma A.5**.**

Proof.

Appendix B Upper bounds

Lemma B.1** (Upper-bound on the thresholds).**

Proof.

Appendix C Proof of the lower bounds

C.1 Part I: (N+n)−1/2(N+n)^{-1/2}(N+n)−1/2

C.2 Part II: n−αγ/(2γ+d)n^{-\alpha\gamma/(2\gamma+d)}n−αγ/(2γ+d)

Lemma C.1**.**

Appendix D Inconsistency of top-β\betaβ approach

Assumption 1.1 (Continuity of CDF).

Proposition 1.2 ( $\beta$ -Oracle confidence set).

Proposition 1.3.

Definition 1.4 (Supervised and semi-supervised estimators).

Definition 1.5 (Minimax rate of convergence).

Assumption 2.1 ( $\alpha$ -margin assumption).

Definition 2.2 (Strong density).

Definition 2.3 (Hölder class, Tsybakov (2008)).

Definition 2.4.

Proposition 3.1.

Proposition 3.2.

3.1 Inconsistency of the top- $\beta$ procedure

Proposition 3.3.

Theorem 3.4 (Supervised estimation).

Theorem 3.5 (Semi-supervised estimation).

Corollary 3.6.

Theorem 4.1 (Supervised estimation).

Theorem 4.2 (Semi-supervised estimation).

Assumption 4.3 (Exponential concentration).

Assumption 4.4 (Continuity of CDF).

Lemma 4.5 (Upper bound on the thresholds).

Proposition 4.6 (Denis and Hebiri (2017)).

5.2 Around Lipschitz continuity of $G^{-1}(\cdot)$

Lemma A.1 (Fano’s inequality (Birgé, 2005)).

Lemma A.2 (Pinsker’s inequality).

Lemma A.3 (Hoeffding’s inequality (Hoeffding, 1963)).

Proposition A.4 (Properties of the generalized inverse).

Lemma A.5.

Lemma B.1 (Upper-bound on the thresholds).

C.1 Part I: $(N+n)^{-1/2}$

C.2 Part II: $n^{-\alpha\gamma/(2\gamma+d)}$

Lemma C.1.

Appendix D Inconsistency of top- $\beta$ approach