Optimal rates for F-score binary classification

Evgenii Chzhen (LAMA)

arXiv:1905.04039·math.ST·May 13, 2019

Optimal rates for F-score binary classification

Evgenii Chzhen (LAMA)

PDF

Open Access

TL;DR

This paper establishes optimal minimax rates for binary classification using F-score under smoothness and margin assumptions, proposing a semi-supervised method that efficiently estimates the classifier with proven optimality.

Contribution

It introduces a semi-supervised classification procedure for F-score maximization that achieves minimax optimal rates under smoothness and margin conditions.

Findings

01

Achieves the rate $O(n^{-(1+eta)eta/(2eta+d)})$ for excess F-score.

02

Establishes the optimality of the proposed rates in a minimax sense.

03

Shows that unlabeled data size does not affect convergence rates.

Abstract

We study the minimax settings of binary classification with F-score under the $β$ -smoothness assumptions on the regression function $η (x) = P (Y = 1∣ X = x)$ for $x \in R^{d}$ . We propose a classification procedure which under the $α$ -margin assumption achieves the rate $O (n^{-- (1 + α) β / (2 β + d)})$ for the excess F-score. In this context, the Bayes optimal classifier for the F-score can be obtained by thresholding the aforementioned regression function $η$ on some level $θ^{*}$ to be estimated. The proposed procedure is performed in a semi-supervised manner, that is, for the estimation of the regression function we use a labeled dataset of size $n \in N$ and for the estimation of the optimal threshold $θ^{*}$ we use an unlabeled dataset of size $N \in N$ . Interestingly, the value of $N \in N$ does not affect the rate…

Equations198

F_{b} (g) : = \frac{P ( Y = 1 , g ( X ) = 1 )}{b ^{2} P ( Y = 1 ) + P ( g ( X ) = 1 )} .

F_{b} (g) : = \frac{P ( Y = 1 , g ( X ) = 1 )}{b ^{2} P ( Y = 1 ) + P ( g ( X ) = 1 )} .

g^{*} \in arg max_{g \in G} F_{b} (g) .

g^{*} \in arg max_{g \in G} F_{b} (g) .

θ \mapsto θ P (Y = 1) - E (η (X) - θ)_{+} .

θ \mapsto θ P (Y = 1) - E (η (X) - θ)_{+} .

g^{*} (x) = \mathds 1_{{η (x) > θ^{*}}},

g^{*} (x) = \mathds 1_{{η (x) > θ^{*}}},

b^{2} θ^{*} P (Y = 1) = E (η (X) - θ^{*})_{+} .

b^{2} θ^{*} P (Y = 1) = E (η (X) - θ^{*})_{+} .

E_{b} (g) : = F_{b} (g^{*}) - F_{b} (g), (excess score) .

E_{b} (g) : = F_{b} (g^{*}) - F_{b} (g), (excess score) .

E_{b} (g) = \frac{E ∣ η ( X ) - θ ^{*} ∣ \mathds 1 _{{g^{*} (X) \neq = g (X)}}}{b ^{2} P ( Y = 1 ) + P ( g ( X ) = 1 )} .

E_{b} (g) = \frac{E ∣ η ( X ) - θ ^{*} ∣ \mathds 1 _{{g^{*} (X) \neq = g (X)}}}{b ^{2} P ( Y = 1 ) + P ( g ( X ) = 1 )} .

P_{X} (0 < ∣ η (X) - θ^{*} ∣ \leq δ) \leq C_{0} δ^{α} .

P_{X} (0 < ∣ η (X) - θ^{*} ∣ \leq δ) \leq C_{0} δ^{α} .

P_{X} (0 < ∣ η (X) - θ^{*} ∣ \leq δ_{0}) = 0,

P_{X} (0 < ∣ η (X) - θ^{*} ∣ \leq δ_{0}) = 0,

P_{X} (0 < ∣ η (X) - θ^{*} ∣ \leq δ) \leq c_{0} δ^{α},

P_{X} (0 < ∣ η (X) - θ^{*} ∣ \leq δ) \leq c_{0} δ^{α},

P^{\otimes n} (∣ \overset{η}{^} (x) - η (x) ∣ \geq t) \leq C_{1} exp (- C_{2} a_{n} t^{2}) a.s. P_{X},

P^{\otimes n} (∣ \overset{η}{^} (x) - η (x) ∣ \geq t) \leq C_{1} exp (- C_{2} a_{n} t^{2}) a.s. P_{X},

\overset{g}{^} (x) = \mathds 1_{{\overset{η}{^} (x) > \hat{θ}}},

\overset{g}{^} (x) = \mathds 1_{{\overset{η}{^} (x) > \hat{θ}}},

θ \frac{1}{N} X_{i} \in D_{N} \sum \overset{η}{^} (X_{i}) = \frac{1}{N} X_{i} \in D_{N} \sum (\overset{η}{^} (X_{i}) - θ)_{+} .

θ \frac{1}{N} X_{i} \in D_{N} \sum \overset{η}{^} (X_{i}) = \frac{1}{N} X_{i} \in D_{N} \sum (\overset{η}{^} (X_{i}) - θ)_{+} .

E_{(D_{n}, D_{N})} ∣ θ^{*} - \hat{θ} ∣ \leq C (a_{n}^{- 1/2} + N^{- 1/2}) .

E_{(D_{n}, D_{N})} ∣ θ^{*} - \hat{θ} ∣ \leq C (a_{n}^{- 1/2} + N^{- 1/2}) .

E_{(D_{n}, D_{N})} E_{1} (\overset{g}{^}) \leq C (a_{n}^{- \frac{1 + α}{2}} + N^{- \frac{1 + α}{2}}),

E_{(D_{n}, D_{N})} E_{1} (\overset{g}{^}) \leq C (a_{n}^{- \frac{1 + α}{2}} + N^{- \frac{1 + α}{2}}),

\hat{θ} - θ^{*} P (Y = 1) \leq \int_{0}^{1} P_{X} (η (X) \leq t) - \frac{1}{N} X_{i} \in D_{N} \sum \mathds 1_{{\overset{η}{^} (X_{i}) \leq t}} d t .

\hat{θ} - θ^{*} P (Y = 1) \leq \int_{0}^{1} P_{X} (η (X) \leq t) - \frac{1}{N} X_{i} \in D_{N} \sum \mathds 1_{{\overset{η}{^} (X_{i}) \leq t}} d t .

E_{(D_{n}, D_{N})} \frac{E ∣ η ( X ) - θ ^{*} ∣ \mathds 1 _{{g^{*} (X) \neq = \overset{g}{^} (X)}}}{P ( Y = 1 ) + P ( g ^ ( X ) = 1 )} \leq \frac{1}{p} E_{(D_{n}, D_{N})} E ∣ η (X) - θ^{*} ∣ \mathds 1_{{g^{*} (X) \neq = \overset{g}{^} (X)}} .

E_{(D_{n}, D_{N})} \frac{E ∣ η ( X ) - θ ^{*} ∣ \mathds 1 _{{g^{*} (X) \neq = \overset{g}{^} (X)}}}{P ( Y = 1 ) + P ( g ^ ( X ) = 1 )} \leq \frac{1}{p} E_{(D_{n}, D_{N})} E ∣ η (X) - θ^{*} ∣ \mathds 1_{{g^{*} (X) \neq = \overset{g}{^} (X)}} .

∣ η (x) - θ^{*} ∣ \leq ∣ η (x) - \overset{η}{^} (x) ∣ + ∣ θ^{*} - \hat{θ} ∣,

∣ η (x) - θ^{*} ∣ \leq ∣ η (x) - \overset{η}{^} (x) ∣ + ∣ θ^{*} - \hat{θ} ∣,

∣ η (x) - θ^{*} ∣

∣ η (x) - θ^{*} ∣

∣ η (x) - θ^{*} ∣

E_{1} (\overset{g}{^}) \leq

E_{1} (\overset{g}{^}) \leq

+ T_{2} \frac{1}{p} E ∣ η (X) - θ^{*} ∣ \mathds 1_{{∣ η (X) - θ^{*} ∣ \leq 2 ∣ θ^{*} - \hat{θ} ∣}} .

E_{(D_{n}, D_{N})} T_{1} \leq C^{'} a_{n}^{- \frac{1 + α}{2}} .

E_{(D_{n}, D_{N})} T_{1} \leq C^{'} a_{n}^{- \frac{1 + α}{2}} .

T_{2} \leq \frac{1}{p} E ∣ η (X) - θ^{*} ∣ \mathds 1_{{E}},

T_{2} \leq \frac{1}{p} E ∣ η (X) - θ^{*} ∣ \mathds 1_{{E}},

E^{1}

E^{1}

E^{2}

T_{2} \leq T_{2}^{1} \frac{1}{p} E ∣ η (X) - θ^{*} ∣ \mathds 1_{{E^{1}}} + T_{2}^{2} \frac{1}{p} E ∣ η (X) - θ^{*} ∣ \mathds 1_{{E^{2}}},

T_{2} \leq T_{2}^{1} \frac{1}{p} E ∣ η (X) - θ^{*} ∣ \mathds 1_{{E^{1}}} + T_{2}^{2} \frac{1}{p} E ∣ η (X) - θ^{*} ∣ \mathds 1_{{E^{2}}},

t \in [0, 1] sup P_{X} (\overset{η}{^} (X) \leq t) - \frac{1}{N} X_{i} \in D_{N} \sum \mathds 1_{{\overset{η}{^} (X_{i}) \leq t}},

t \in [0, 1] sup P_{X} (\overset{η}{^} (X) \leq t) - \frac{1}{N} X_{i} \in D_{N} \sum \mathds 1_{{\overset{η}{^} (X_{i}) \leq t}},

E_{(D_{n}, D_{N})} T_{2}^{1} \leq C^{''} N^{- \frac{1 + α}{2}} .

E_{(D_{n}, D_{N})} T_{2}^{1} \leq C^{''} N^{- \frac{1 + α}{2}} .

T_{2}^{2} \leq \frac{4}{p ^{2}} \int_{0}^{1} ∣ P_{X} (\overset{η}{^} (X) \leq t) - P_{X} (η (X) \leq t) ∣ d t P (E^{2}),

T_{2}^{2} \leq \frac{4}{p ^{2}} \int_{0}^{1} ∣ P_{X} (\overset{η}{^} (X) \leq t) - P_{X} (η (X) \leq t) ∣ d t P (E^{2}),

T_{2}^{2} \leq \frac{C _{0} 4 ^{1 + α}}{p ^{2 + α}} (\int_{0}^{1} ∣ P_{X} (\overset{η}{^} (X) \leq t) - P_{X} (η (X) \leq t) ∣ d t)^{1 + α},

T_{2}^{2} \leq \frac{C _{0} 4 ^{1 + α}}{p ^{2 + α}} (\int_{0}^{1} ∣ P_{X} (\overset{η}{^} (X) \leq t) - P_{X} (η (X) \leq t) ∣ d t)^{1 + α},

T_{2}^{2} \leq \frac{C _{0} 4 ^{1 + α}}{p ^{2 + α}} (E_{P_{X}} ∣ η (X) - \overset{η}{^} (X) ∣)^{1 + α} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Methods and Inference · Advanced Statistical Methods and Models · Machine Learning and Algorithms

Full text

∎

11institutetext: LAMA, Université Paris-Est 22institutetext: Cité Descartes

5 boulevard Descartes

77454 Marne-la-Vallée cedex 2

22email: [email protected]

Optimal rates for F-score binary classification

Evgenii Chzhen

Abstract

We study the minimax settings of binary classification with F-score under the $\beta$ -smoothness assumptions on the regression function $\eta(x)=\mathbb{P}(Y=1|X=x)$ for $x\in\mathbb{R}^{d}$ . We propose a classification procedure which under the $\alpha$ -margin assumption achieves the rate $\mathcal{O}(n^{-(1+\alpha)\beta/(2\beta+d)})$ for the excess F-score. In this context, the Bayes optimal classifier for the F-score can be obtained by thresholding the aforementioned regression function $\eta$ on some level $\theta^{*}$ to be estimated. The proposed procedure is performed in a semi-supervised manner, that is, for the estimation of the regression function we use a labeled dataset of size $n\in\mathbb{N}$ and for the estimation of the optimal threshold $\theta^{*}$ we use an unlabeled dataset of size $N\in\mathbb{N}$ . Interestingly, the value of $N\in\mathbb{N}$ does not affect the rate of convergence, which indicates that it is “harder” to estimate the regression function $\eta$ than the optimal threshold $\theta^{*}$ . This further implies that the binary classification with F-score behaves similarly to the standard settings of binary classification. Finally, we show that the rates achieved by the proposed procedure are optimal in the minimax sense up to a constant factor.

1 Introduction

The problem of binary classification is among the most basic and well-studied problems in statistics and machine learning Vapnik98 ; Yang99 ; Bartlett_Mendelson02 ; Audibert04 ; Massart_Nedelec06 ; Audibert_Tsybakov07 . Until very recently, theoretical guarantees were almost exclusively formulated in terms of the probability of miss-classification (a.k.a accuracy) as the measure of the risk. This choice of the risk is practically suitable in the case of the “well-balanced” distributions and datasets, that is, the probabilities to observe both classes are similar.

Once this assumption fails to be satisfied, classifiers based on the accuracy might perform poorly in practice. One possible approach to treat such a situation is to modify the measure to be optimized in an appropriate way. A popular choice of such measure is the F-score, whose roots can be tracked back to the information retrieval literature Rijsbergen74 ; Lewis95 . From the statistical point of view there are two alternative approaches Ye_Chai_Lee_Chieu12 ; Dembczynski_Kotlowski_Koyejo_Natarajan17 to the theoretical treatment of the F-score: Population Utility (PU) and Expected Test Utility (ETU). In this work we follow the PU approach which, as noted in Dembczynski_Kotlowski_Koyejo_Natarajan17 , has stronger roots in classical statistics. Our goal is to provide minimax analysis of the binary classification with F-score under non-parametric assumptions.

2 The problem formulation

We first introduce some notation that is used throughout this work. For any two real numbers $a,b\in\mathbb{R}$ we denote by $a\wedge b$ (resp. $a\vee b$ ) the minimum (resp the maximum) between $a$ and $b$ . The standard Euclidean norm in $\mathbb{R}^{d}$ is denoted by $\left\lVert\cdot\right\rVert_{2}$ and a ball centered at $x\in\mathbb{R}^{d}$ of radius $r$ is denoted by $\mathcal{B}(x,r)$ . For positive real valued sequences $a_{n},b_{n}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{N}\mapsto\mathbb{R}_{+}$ we say that $a_{n}=\mathcal{O}(b_{n})$ if there exists some positive constant $M>0$ such that for all $n\in\mathbb{N}$ it holds that $a_{n}/b_{n}\leq M$ . We consider a random couple $(X,Y)$ taking values in $\mathbb{R}^{d}\times\{0,1\}$ with joint distribution $\mathbb{P}$ . The vector $X\in\mathbb{R}^{d}$ is the feature vector and the binary variable $Y\in\{0,1\}$ is the label, in what follows we assume that $\mathbb{P}(Y=1)\neq 0$ . We denote by $\mathbb{P}_{X}$ the marginal distribution of the feature vector $X\in\mathbb{R}^{d}$ and by $\eta(X)\coloneqq\mathbb{P}(Y=1|X)$ the regression function. A classifier is any measurable function $g\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto\{0,1\}$ and the set of all such functions is denoted by $\mathcal{G}$ .

We assume that we have access to two datasets: the first dataset $\mathcal{D}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}$ consists of $n\in\mathbb{N}$ *i.i.d. *copies of $(X,Y)\sim\mathbb{P}$ ; and the second dataset $\mathcal{D}_{N}=\{X_{i}\}_{i=n+1}^{n+N}$ consists of $N\in\mathbb{N}$ independent copies of $X\sim\mathbb{P}_{X}$ . Denote by $\mathbb{P}^{\otimes n}$ and $\mathbb{P}_{X}^{\otimes N}$ the distributions of $\mathcal{D}_{n}$ and $\mathcal{D}_{N}$ respectively. Moreover, we denote by $\mathbb{E}_{(\mathcal{D}_{n},\mathcal{D}_{N})}$ the expectation with respect to the distribution of $(\mathcal{D}_{n},\mathcal{D}_{N})$ , that is, with respect to $\mathbb{P}^{\otimes n}\otimes\mathbb{P}_{X}^{\otimes N}$ on the space $\left(\mathbb{R}^{d}\times\{0,1\}\right)^{n}\times\left(\mathbb{R}^{d}\right)^{N}$ . We additionally assume that the size of the unlabeled dataset is not smaller that the size of the labeled dataset111Note that one can always satisfy this assumption by augmenting $\mathcal{D}_{N}$ using a portion of $\mathcal{D}_{n}$ and erasing labels. Typically, in practice it is easier to gather the unlabeled data then labeled, that is why this assumption is rather a formality., that is, $N\geq n$ . For a given classifier $g\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto\{0,1\}$ we define its $\text{F}_{b}$ -score222We decided to divide the classical definition of the $\text{F}_{b}$ -score by the factor $1+b^{2}$ to simplify the notation, thus, it is sufficient to multiply the obtained results by $1+b^{2}$ , to recover the results on the classical definition of the $\text{F}_{b}$ -score. for any $b>0$ by

[TABLE]

A Bayes-optimal classifier $g^{*}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto\{0,1\}$ is any classifier that maximizes the F-score over all classifiers $\mathcal{G}$ , that is,

[TABLE]

It was established by Zhao_Edakunni_Pocock_Brown13 that a maximizer of the $\text{F}_{1}$ -score can be obtained by comparing the regression function $\eta(X)$ with a threshold $\theta^{*}\in[0,1]$ . Importantly, this threshold depends explicitly on the distribution $\mathbb{P}$ and can be obtained as unique root of

[TABLE]

One of the contributions of this work is extension of the result of (Zhao_Edakunni_Pocock_Brown13, , Section 6) for an arbitrary value of $b>0$ .

Theorem 2.1

A Bayes-optimal classifier $g^{*}$ can be obtained point-wise for all $x\in\mathbb{R}^{d}$ as

[TABLE]

where $\theta^{*}\in[0,1]$ is a threshold which satisfies

[TABLE]

Moreover, the classifier $g^{*}$ satisfies $F_{b}(g^{*})=\theta^{*}$ .

The proof can be found in Appendix A Notice that if the optimal threshold $\theta^{*}\in[0,1]$ is known a priori, the problem of binary classification with the F-score is no harder than the standard settings of binary classification with the accuracy as the measure of performance. As the threshold $\theta^{*}\in[0,1]$ depends on the distribution $\mathbb{P}$ , it could be estimated using data. Theorem 2.1 allows to obtain a trivial upper bound on the threshold $\theta^{*}$ , indeed, since $\theta^{*}=F_{b}(g^{*})$ and for any classifier $g\in\mathcal{G}$ the $\text{F}_{b}$ -score is upper bounded by $1/(1+b^{2})$ we have $\theta^{*}\in[0,1/(1+b^{2})]$ .

For any classifier $g\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto\{0,1\}$ we define its excess score as

[TABLE]

The excess score is the central object of our analysis and one of our goals is to provide an estimator whose excess score is as small as possible. Using Theorem 2.1 we can show that the excess score of any classifier $g\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto\{0,1\}$ can be written in a simple form.

Lemma 1

Let $g\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto\{0,1\}$ be any classifier and assume that $\mathbb{P}(Y=1)\neq 0$ , then

[TABLE]

In general the Bayes optimal rule is not unique, Theorem 2.1 only states that one of the optimal classifiers has the form described by its statement. Even though, the function $\theta\mapsto b^{2}\theta\mathbb{P}(Y=1)-\mathbb{E}(\eta(X)-\theta)_{+}$ has unique root (see Appendix C for the proof), other thresholds may result in the same Bayes rule. Indeed, consider a simple example with $\eta(x)\equiv 1/2$ , $b=1$ , then it is easy to see that the solution $\theta^{*}$ of $\theta/2=(1/2-\theta)_{+}$ is exactly $1/3$ , and every Bayes optimal classifier predicts one almost surely. Clearly, any threshold $\theta\in[0,1/2)$ of the regression function $\eta$ results in the same classifier. Importantly, Lemma 1 and the equality $\mathop{\mathrm{arg\,max}}_{g\in\mathcal{G}}F_{1}(g)=\theta^{*}$ are valid only for the threshold $\theta^{*}=1/3$ . In this work, we shall always refer to $\theta^{*}$ being the solution of $b^{2}\theta\mathbb{P}(Y=1)=\mathbb{E}(\eta(X)-\theta)_{+}$ and we call this threshold as the optimal threshold.

Remark 1

For the rest of the paper, we focus our attention only on the value $b=1$ to simplify the presentation. It will be clear from our arguments that the generalization of the theoretical results of the paper to an arbitrary value $b>0$ follows straightforwardly from our analysis.

Interestingly, the results above demonstrate that the problem of binary classification with F-score has a lot in common with the standard settings. Indeed, in both cases the Bayes optimal classifier is obtained via thresholding of the regression function and the expression for the excess risk is also similar. Consequently, in this work we address the following questions

Q1.:

Is the problem of binary classification with F-score harder than its more known counterpart? In particular, can the minimax analysis of Audibert_Tsybakov07 be extended to these settings and what is an optimal algorithm?

Q2.:

In view of recent results of Chzhen_Denis_Hebiri19 , we wonder if the introduction of unlabeled dataset can improve classification algorithms in the context of F-score.

Let us point out, that Lemma 1 is crucial for our analysis as it allows to adapt the scheme provided by Audibert_Tsybakov07 for the standard setting of the binary classification. Nevertheless, as the threshold $\theta^{*}\in[0,1]$ is unknown beforehand, this machinery cannot be applied in a straightforward way and some effort is required. In this work, we pose similar assumptions on the distribution $\mathbb{P}$ to the ones used in Audibert_Tsybakov07 .

Assumption 1 ( $\alpha$ -margin assumption)

We say that the distribution $\mathbb{P}$ of the pair $(X,Y)\in\mathbb{R}^{d}\times\{0,1\}$ satisfies the $\alpha$ -margin assumption if there exist constants $C_{0}>0$ , $\delta_{0}\in(0,1/12]$ and $\alpha>0$ such that for every positive $\delta\leq\delta_{0}$ we have

[TABLE]

The case of “ $\alpha=\infty$ ” is understood in the following manner Massart_Nedelec06 : there exists a constant $\delta_{0}\in(0,1]$ such that

[TABLE]

typically this is the most advantageous situation for the binary classification, as the regression function $\eta$ is separated from the optimal threshold $\theta^{*}$ . Assumption 1 specifies the concentration rate of the regression function $\eta$ around the optimal threshold $\theta^{*}$ . Notice, if Assumption 1 is satisfied, it holds that for all $\delta>0$

[TABLE]

where $c_{0}=C_{0}\vee\delta_{0}^{-\alpha}$ . This condition is tightly related to the rate of convergence in the case of the binary classification Audibert_Tsybakov07 ; Massart_Nedelec06 . The classification algorithm that is proposed in this work is based on a direct estimation of the regression function $\eta$ and the optimal threshold $\theta^{*}$ .

In the sequel, we consider the case of non-parametric estimation, that is we assume that the regression function $\eta\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto\{0,1\}$ lies in some class of $\beta$ -smooth functions and the marginal density $\mathbb{P}_{X}$ of $X\in\mathbb{R}^{d}$ admits density *w.r.t. *to the Lebesgue measure supported on a well-behaved compact set and uniformly lower- and upper bounded. The exact formal description of these assumptions is given in Section 4.2, where we prove optimality of our rates. As for now, it is sufficient to assume that there exists a good estimator $\hat{\eta}$ based on the labeled set $\mathcal{D}_{n}$ of the regression function $\eta$ .

Assumption 2 (Existence of estimator)

There exists an estimator $\hat{\eta}$ based on $\mathcal{D}_{n}$ which satisfies for all $t>0$

[TABLE]

for some universal constants $C_{1},C_{2}>0$ and an increasing sequence $a_{n}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{N}\mapsto\mathbb{R}_{+}$ .

For instance, in the case of $\beta$ -smooth regression function333Typically, one also need to assume that the marginal distribution $\mathbb{P}_{X}$ is well behaved, see Section 4.2. $\eta\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto[0,1]$ , a typical non-parametric rate is $a_{n}=n^{{2\beta}/{(2\beta+d)}}$ and it can be achieved by the local polynomial estimator, see (Audibert_Tsybakov07, , Theorem 3.2). Finally, in this work we assume that the probability $\mathbb{P}(Y=1)$ is lower bounded by some constant which can be arbitrary small but fixed.

Assumption 3 (Lower bounded $\mathbb{P}(Y=1)$ )

We assume that there exists a positive constant $p$ such that $p\leq\mathbb{P}(Y=1)$ .

It is assumed that the constants $C_{0},C_{1},C_{2},p$ are independent of both $n,N\in\mathbb{N}$ , however these constants can depend on the dimension of the problem $d$ , on the value of $\alpha>0$ as well as on each other. The values of the constants $C_{0},C_{1},C_{2},p$ are not going to impact the rates of convergence, though they might and will enter as numerical constants in front of the rate. In contrast, the value of $\alpha$ in the margin assumption will explicitly appear in the obtained rates.

3 Related works and contributions

Literature on the binary classification with F-score is rather broad, it spans both applied and theoretical studies of the problem. It should be noted that our work falls into the Population Utility (PU) approach Dembczynski_Kotlowski_Koyejo_Natarajan17 , that is, the expectation is taken in the numerator and the denominator of the F-score simultaneously. This approach should not be confused with the Expected Test Utility (ETU) approach, for which a non-asymptotic behavior can differ significantly. We refer the reader to Dembczynski_Kotlowski_Koyejo_Natarajan17 ; Ye_Chai_Lee_Chieu12 where the PU vs. ETU tale is discussed in depth and their asymptotic equivalency is established. Let us mention that, the asymptotic statistical theory of the binary classification with F-score has been studied in the prior literature Koyejo_Natarajan_Ravikumar_Dhillon14 ; Narasimhan_Vaish_Agarwal14 ; Menon_Narasimhan_Agarwal_Chawla13 ; Ye_Chai_Lee_Chieu12 . Bellow, we summarize our contributions and highlight the improvements with respect to the previous results on the non-asymptotic analysis of the binary classification with F-score.

•

We propose a two-step estimator, which first estimates the regression function $\eta$ and then the optimal threshold $\theta^{*}$ . This type of two-step estimators, which involve an explicit thresholds tuning, are well-known in the literature and demonstrate a promising empirical performance Koyejo_Natarajan_Ravikumar_Dhillon14 ; Keerthi_Sindhwani_Chapelle07 . An important novelty introduced here is the semi-supervised nature of the procedure which can exploit the unlabeled data. It is already a well established fact that the semi-supervised methods might Singh_Nowak_Zhu09 or not Rigollet07 improve supervised estimation from statistical point of view. However, let us point out, that from the practical point of view, typically the most expensive part of the data gathering process is the correct labeling. Thus, one may assume that the unlabeled dataset $\mathcal{D}_{N}$ is always available in reality and the settings $N\gg n$ are satisfied. Our analysis implies that in the setting of binary classification with F-score the semi-supervised techniques are not superior to the supervised ones. In contrast, in Chzhen_Denis_Hebiri19 the authors showed that in the context of confidence set classification semi-supervised classifiers might outperform it supervised counterparts.

•

From the theoretical point of view, the most relevant reference is a recent work of Yan_Koyejo_Zhong_Ravikumar18 , where the authors studied a rather broad class of performance measures for the problem of binary classification, namely Karmic measures. This class includes the F-score, considered in the present manuscript. Under similar, though stronger assumptions on the distribution444The authors additionally require that the random variable $\eta(X)$ on $[0,1]$ admits bounded density. of the pair $(X,Y)\in\mathbb{R}^{d}\times\{0,1\}$ they proposed an algorithm whose rate of convergence is at most $\mathcal{O}(a_{n}^{-{(1+1\wedge\alpha)}/{2}})$ . This rate is rather counter intuitive, since it suggests that if the constant $\alpha$ in the margin assumption is large it does not affect the rate of convergence. In contrast, here we show that for the proposed algorithm the rate of convergence is of order $\mathcal{O}(a_{n}^{-{(1+\alpha)}/{2}})$ . That is, it strictly improves upon the results in Yan_Koyejo_Zhong_Ravikumar18 whenever the constant $\alpha>1$ . However, it should be noted, that the authors of Yan_Koyejo_Zhong_Ravikumar18 study a much more general family of the score functions and the sub-optimal rate can result from such a generality.

•

We show that the constructed estimator is optimal in the minimax sense over the class of Hölder smooth regression functions. Let us mention that the optimality of the bound is expected, as in the classical work of Audibert_Tsybakov07 the authors showed that the minimax risk in the standard binary classification settings is of order $a_{n}^{-{(1+\alpha)}/{2}}$ , and it is achieved by a plug-in rule classifier. Clearly, it is hard to expect that the rate in a more difficult situation can be improved. Nevertheless, to the best of our knowledge, the minimax optimality in the context of binary classification with F-score have not been considered before.

The paper is organized as follows: in Section 4 we present the semi-supervised classification algorithm; in Section 4.1 we establish an upper bound on the excess F-score under the margin assumption; in Section 4.2 we introduce the class of distributions considered in this work and establish a minimax lower bound on the excess F-score.

4 Main results

In this section we describe the proposed procedure $\hat{g}$ to estimate the Bayes optimal classifier $g^{*}$ , this procedure is performed in two steps. On the first step we estimate the regression function $\eta\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto\{0,1\}$ using the labeled data $\mathcal{D}_{n}$ and on the second step we estimate the optimal threshold $\theta^{*}$ based on the unlabeled data $\mathcal{D}_{N}$ and the estimator $\hat{\eta}$ provided by the first step. This procedure falls into the category of plug-in type classifiers, that is, we formally replace all the unknown quantities in the Bayes rule by its estimates. That is, the classifier $\hat{g}$ is defined as

[TABLE]

where $\hat{\eta}$ is any estimator satisfying Assumption 2 and $\hat{\theta}$ is the unique solution of

[TABLE]

In practice one can use a simple bisection algorithm (Conte_Boor80, , Algorithm 3.1) or its more sophisticated modifications (regula falsi or the secant method) to approximate $\hat{\theta}$ with any given precision. For our theoretical analysis we assume that Equation (2) is solved exactly. However a simple modification of our arguments can handle the situation when the threshold $\hat{\theta}$ is known up to an additive factor $\epsilon_{n}=\mathcal{O}(a_{n}^{-1/2})$ .

4.1 Upper bound

The main result of this subsection is an upper bound on excess score of the proposed procedure. Here we provide two theorems, the first one gives an upper bound on the expected difference between the optimal threshold $\theta^{*}$ and its estimate $\hat{\theta}$ . The second one gives an upper bound on the excess F-score.

Theorem 4.1

If there exists an estimator $\hat{\eta}$ of the regression function $\eta$ which satisfies Assumption 2, then there exists a constant $C>0$ which depends on $C_{0},C_{1},C_{2},p$ such that, the threshold $\hat{\theta}$ defined in Eq. (2) satisfies

[TABLE]

Theorem 4.2

If the distribution $\mathbb{P}$ of $(X,Y)$ satisfies the $\alpha$ -margin assumption for some $C_{0}>0$ and $\alpha\geq 0$ and there exists an estimator $\hat{\eta}$ of the regression function $\eta$ which satisfies Assumption 2, then there exists a constant $C>0$ which depends on $\alpha,C_{0},C_{1},C_{2},p$ such that

[TABLE]

where $\hat{g}(x)=\mathds{1}_{\left\{\hat{\eta}(x)>\hat{\theta}\right\}}$ with the threshold $\hat{\theta}$ defined in Equation (2).

Before proceeding to the proofs let us discuss the implications of these results. First of all, there are two regimes in the bound of Theorems 4.2, the first one is $N\geq a_{n}$ , in this regime, the dominant term is $a_{n}^{-{(1+\alpha)}/{2}}$ which is the classical rate of convergence in the standard settings of binary classification with the $\alpha$ -margin assumption. The second regime is when $N<a_{n}$ , then the dominating term of the bound is $N^{-{(1+\alpha)}/{2}}$ . However, let us recall that one can always augment the second unlabeled dataset $\mathcal{D}_{N}$ by dividing $\mathcal{D}_{n}$ into two independent parts. It implies that the second regime never occurs in our theoretical analysis of the excess score and the upper bound is actually independent of $N$ . Similar reasoning holds for the case of the optimal threshold estimation in Theorem 4.1. Once it is clear that the obtained upper bounds are actually independent of the size of the unlabeled dataset $\mathcal{D}_{N}$ it is interesting to notice that the dependence on $n$ is the same as in the standard case of the binary classification Audibert_Tsybakov07 . That is, similarly to the standard settings, the binary classification with F-score can achieve fast (faster than $1/\sqrt{n}$ ) and even super-fast (faster than $1/n$ ) rate depending on the interplay of $\alpha,\beta,d$ .

Proofs of both theorems relies on the following lemma, provided in Appendix B, which relates the difference of the thresholds to the difference of the cumulative distribution function empirical of (CDF) $\eta$ and empirical CDF of $\hat{\eta}$ .

Lemma 2

Let $\hat{\theta}\in[0,1]$ be the threshold which satisfies Equation 2, then

[TABLE]

This result is the main reason why our conclusions on the semi-supervised estimation is different from the ones in Chzhen_Denis_Hebiri19 ; Singh_Nowak_Zhu09 . For instance, in Chzhen_Denis_Hebiri19 the authors also obtain a final decision rule by thresholding on some estimated level. However, in the present work the difference between $\theta^{*}$ and $\hat{\theta}$ is controlled via $\ell_{1}$ -norm of difference of CDF’s, whereas in Chzhen_Denis_Hebiri19 they control a similar quantity through Wassertstein infinity distance.

The complete proof of Theorems 4.1 and 4.2 can be found in Appendix C, we only sketch the steps which are different from the analysis of Audibert_Tsybakov07 . Recall, that due to Lemma 1 we have the following bound for the excess score $\mathcal{E}_{1}$

[TABLE]

First of all, notice that if for some $x\in\mathbb{R}^{d}$ the event $g^{*}(x)\neq\hat{g}(x)$ occurs, than we have

[TABLE]

which further implies that at least one of the following inequalities hold for this $x\in\mathbb{R}^{d}$

[TABLE]

Thus, we can upper bound the excess risk as

[TABLE]

The first term on the right hand side ( $T_{1}$ ) of the inequality can be handled by the peeling technique used in (Audibert_Tsybakov07, , Lemma 3.1.), which implies that, there exists a constant $C^{\prime}=C^{\prime}(p,\alpha,C_{0},C_{1},C_{2})>0$ such that

[TABLE]

Hence, it remains to upper bound the second term on the right hand side $(T_{2})$ of the inequality. Using Lemma 2 we can upper bound $T_{2}$ as

[TABLE]

with $E=\left\{p\left\lvert\eta(X)-\theta^{*}\right\rvert\leq 2\int_{0}^{1}\left\lvert\mathbb{P}_{X}(\eta(X)\leq t)-\frac{1}{N}\sum_{X_{i}\in\mathcal{D}_{N}}\mathds{1}_{\left\{\hat{\eta}(X_{i})\leq t\right\}}\right\rvert dt\right\}$ . Finally, we upper bound the indicator $\mathds{1}_{\left\{E\right\}}$ by the indicators of two events $E^{1}$ and $E^{2}$ which are defined as

[TABLE]

Thus, we have the following upper bound on $T_{2}$

[TABLE]

Notice that thanks to the Dvoretzky-Kiefer-Wolfowitz inequality Dvoretzky_Kiefer_Wolfowitz56 ; Massart90 the term

[TABLE]

conditionally on $\mathcal{D}_{n}$ admits an exponential concentration with the rate $N^{-1/2}$ . Hence, using the margin assumption, one can effortlessly show there exists a constant $C^{\prime\prime}=C^{\prime\prime}(p,\alpha,C_{0})>0$ such that

[TABLE]

For the second term $T_{2}^{2}$ we proceed as follows

[TABLE]

thus, using the $\alpha$ -margin assumption we get

[TABLE]

the integral on the right hand side of the bound corresponds to the $1$ -Wasserstein distance on the real line, see for instance (Bobkov_Ledoux16, , Theorem 2.9) or Vallender74 for the proof, and can be further upper bounded by the $L_{1}$ norm between $\hat{\eta}$ and $\eta$ , that is

[TABLE]

Since the estimator $\hat{\eta}$ satisfies Assumption 2, one can show that there exists a constant $C^{\prime\prime\prime}=C^{\prime\prime\prime}(p,\alpha,C_{0},C_{1},C_{2})>0$ such that

[TABLE]

Combination of all the inequalities yields the result of Theorem 4.2. Notice that the same reasoning starting from Lemma 2 implies the upper bound on the threshold estimation, that is, Theorem 4.1.

4.2 Lower bound

In the beginning of the section we state the class of distribution $\mathcal{P}_{\Sigma}$ of the random pair $(X,Y)\in\mathbb{R}^{d}\times\{0,1\}$ considered in this work. The first assumption is made on smoothness of the regression function $\eta\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto[0,1]$ .

Definition 1 (Hölder smoothness)

Let $L>0$ and $\beta>0$ . The class of function $\Sigma(\beta,L,\mathbb{R}^{d})$ consists of all functions $h\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto[0,1]$ such that for all $x,x^{\prime}\in\mathbb{R}^{d}$ , we have

[TABLE]

where $h_{x}(\cdot)$ is the Taylor polynomial of $h$ at point $x$ of degree $\lfloor\beta\rfloor$ .

Assumption 4 ( $(\beta,L)$ -Hölder regression function)

The distribution $\mathbb{P}$ of the pair $(X,Y)\in\mathbb{R}^{d}\times\{0,1\}$ is such that $\eta\in\Sigma(\beta,L,\mathbb{R}^{d})$ for some positive $\beta,L$ .

Assumption 4 is usually not sufficient to guarantee the existence of an estimator $\hat{\eta}$ satisfying Assumption 2: extra assumptions are required on the marginal distribution $\mathbb{P}_{X}$ of the vector $X\in\mathbb{R}^{d}$ .

Definition 2

A Lebesgue measurable set $A\subset\mathbb{R}^{d}$ is said to be $(c_{0},r_{0})$ -regular for some constants $c_{0}>0,r_{0}>0$ if for every $x\in A$ and every $r\in(0,r_{0}]$ we have

[TABLE]

where $\lambda$ is the Lebesgue measure and $\mathcal{B}(x,r)$ is the Euclidean ball of radius $r$ centered at $x$ .

Assumption 5 (Strong density assumption)

We say that the marginal distribution $\mathbb{P}_{X}$ of the vector $X\in\mathbb{R}^{d}$ satisfies the strong density assumption if

•

$\mathbb{P}_{X}$ * is supported on a compact $(c_{0},r_{0})$ -regular set $A\subset\mathbb{R}^{d}$ ,*

•

$\mathbb{P}_{X}$ * admits a density $\mu$ *w.r.t. to the Lebesgue measure uniformly lower- and upper-bounded by ${\mu_{\min}}>0$ and $\mu_{\max}>0$ respectively.

If the regression function $\eta\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto[0,1]$ is $(\beta,L)$ -Hölder and the marginal distribution satisfies the strong density assumption, one can state the following result due to Audibert_Tsybakov07 .

Theorem 4.3 (Audibert_Tsybakov07 )

Let $\mathcal{P}$ be a class of distributions on $\mathbb{R}^{d}\times\{0,1\}$ such that the regression function $\eta\in\Sigma(\beta,L,\mathbb{R}^{d})$ and the marginal distribution $\mathbb{P}_{X}$ satisfies the strong density assumption. Then, there exists an estimator $\hat{\eta}$ of the regression function satisfying

[TABLE]

for come constants $C_{1},C_{2}$ depending on $\beta,d,L,c_{0},r_{0}$ .

Consider a class of distribution $\mathcal{P}_{\Sigma}$ for which Assumptions 1, 3, 4, 5 are satisfied, then Theorem 4.3 and Theorems 4.1, 4.2 imply the following corollary.

Corollary 1

There exist constants $C,B>0$ which depend only on $\alpha,p,d,C_{0},C_{1},C_{2}$ such that for any $n>1,N>1$ we have

[TABLE]

where the infima are taken over all estimators $\hat{g}$ and $\hat{\theta}$ respectively.

The next theorem states that the upper bounds of the previous corollary are optimal up to a constant multiplicative factor.

Theorem 4.4

If $\alpha\beta\leq d$ , there exists constants $c>0$ such that for any $n>1,N>1$ we have the following lower-bound on the minimax risk

[TABLE]

where the infimum is taken over all estimators $\hat{g}$ .

The proof of the lower bound can be found in Appendix D, it follows standard information-theoretic arguments using reduction of the minimax risk to a Bayes risk. The construction of the distributions is inspired by both Rigollet_Vert09 and Audibert_Tsybakov07 , and the actual proof relies on (Audibert04, , Lemma 5.1.), which is based on the Assouad’s lemma, see for instance (Tsybakov09, , Lemma 2.12).

5 Conclusion

In this work we proposed a semi-supervised plug-in type algorithm for the problem of binary classification with F-score. The proposed algorithm can leverage an unlabeled dataset for the estimation of the optimal threshold. Under the margin assumption it is shown that the proposed algorithm is optimal in the minimax sense and can achieve fast rates of convergence. Further development of the binary classification with F-score will be devoted to empirical risk minimization rules.

Acknowledgements.

This work was partially supported by “Labex Bézout” of Université Paris-Est. Besides, we would like to thank Joseph Salmon and Mohamed Hebiri for their thoughtful remarks.

Appendix A Bayes classifier and Lemma 1

For the rest of this section the parameter $b>0$ is assumed to be fixed and known. Let us first recall the definition of the $\text{F}_{b}$ -score

[TABLE]

and an optimal classifier is defined as

[TABLE]

In this section we would like to show that a classifier defined for all $x\in\mathbb{R}^{d}$ as

[TABLE]

with $\theta^{*}$ being a root of

[TABLE]

Let us first show that $\theta^{*}$ is well-defined, that is, it exists and is unique for every distribution with $\mathbb{P}(Y=1)\neq 0$ . Hence, we would like to study solutions of the following equation

[TABLE]

Clearly, the mapping $\theta\mapsto b^{2}\mathbb{P}(Y=1)\theta$ is continuous and strictly increasing on $[0,1]$ and the mapping $\theta\mapsto\mathbb{E}(\eta(X)-\theta)_{+}$ is non-increasing on $[0,1]$ . Thus, it is sufficient to demonstrate that the mapping $\theta\mapsto\mathbb{E}(\eta(X)-\theta)_{+}$ is continuous, indeed, let $\theta,\theta^{\prime}\in[0,1]$ , then, due to the Lipschitz continuity of $(\cdot)_{+}$ we can write

[TABLE]

This implies that the mapping $\theta\mapsto\mathbb{E}(\eta(X)-\theta)_{+}$ is a contraction and thus is continuous. Hence, the threshold $\theta^{*}$ is well-defined, that is, it exists and is unique. Consequently, the classifier $x\mapsto\mathds{1}_{\left\{\eta(x)\geq\theta^{*}\right\}}$ is well-defined.

Now, we are interested in the value $F_{b}(g_{*})$ , we can write

[TABLE]

using the definition of $\theta^{*}$ we continue as

[TABLE]

To conclude the optimality of $g_{*}$ we prove Lemma 1.

Proof

Fix an arbitrary measurable function $g\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto\{0,1\}$ , then by the definition of the excess score we have

[TABLE]

Using Theorem 2.1 we know that $\theta^{*}=F_{b}(g^{*})$ and therefore

[TABLE]

We conclude by solving the previous equality for $\mathcal{E}(g)$ . Thus, $g_{*}$ is a Bayes optimal classifier and hence can be denoted by $g^{*}$ .

Appendix B Proof of Lemma 2

Proof

To prove this lemma, it is convenient to rewrite Equation 2 in terms of CDF. Let $\mu$ be an arbitrary probability measure on $\mathbb{R}^{d}$ and $p\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{d}\mapsto[0,1]$ be any measurable function, then using Fubini’s theorem we can write

[TABLE]

and for any $\theta\in[0,1]$ , since $(p(x)-\theta)_{+}\in[0,1]$ we have

[TABLE]

Let us denote by $\mathbb{P}_{X,N}=\tfrac{1}{N}\sum_{X_{i}\in\mathcal{D}_{N}}\delta_{X_{i}}$ the empirical measure of the unlabeled dataset $\mathcal{D}_{N}$ . Using these equalities, the thresholds $\theta^{*},\hat{\theta}\in[0,1]$ satisfy

[TABLE]

Now, we are in position to bound the difference $|\hat{\theta}-\theta^{*}|$ , first assume that $\theta^{*}\geq\hat{\theta}$ , then

[TABLE]

Further, if $\hat{\theta}>\theta^{*}$ we can write

[TABLE]

where the last inequality follows the same lines as for the case $\hat{\theta}\leq\theta^{*}$ .

Appendix C Proof of the upper bound

Let $\hat{\eta}$ be an estimator of the regression function based on the labeled dataset $\mathcal{D}_{n}$ which satisfies Assumption 2. Recall, that the estimator $\hat{g}$ is defined for every $x\in\mathbb{R}^{d}$ as

[TABLE]

with $\hat{\theta}$ being the unique solution of Eq. (2). Unless stated otherwise, we work conditionally on $(\mathcal{D}_{n},\mathcal{D}_{N})$ . Using Lemma 1 we can express the excess score of $\hat{g}$ as

[TABLE]

Clearly, on the event $\left\{g^{*}(X)\neq\hat{g}(X)\right\}$ it holds that $\left\{\left\lvert\eta(X)-\theta^{*}\right\rvert\leq\left\lvert\hat{\eta}(X)-\eta(X)\right\rvert+\left\lvert\hat{\theta}-\theta^{*}\right\rvert\right\}$ , thus

[TABLE]

Using, Lemma 2 the excess risk can be further upper bounded as

[TABLE]

Notice that $\int_{0}^{1}\left\lvert\mathbb{P}_{X}(\eta(X)\leq t)-\mathbb{P}_{X}(\hat{\eta}(X)\leq t)\right\rvert dt=\left\lVert F_{\eta}-F_{\hat{\eta}}\right\rVert_{1}$ , with $F_{\eta},F_{\hat{\eta}}$ being the cumulative distribution functions of $\eta,\hat{\eta}$ respectively, corresponds to the 1-Wasserstein distance, see Bobkov_Ledoux16 for an in-depth discussion. Therefore, we have

[TABLE]

and introducing notation $\hat{\mathbb{P}}_{X}\vcentcolon=\tfrac{1}{N}\sum_{X_{i}\in\mathcal{D}_{N}}\delta_{X_{i}}$ for the empirical measure of the feature vector $X$ we can write

[TABLE]

Finally, using the margin Assumption 1 we can write

[TABLE]

Taking expectation from the both sides with respect to the distribution of $(\mathcal{D}_{n},\mathcal{D}_{N})$ we follow (Audibert_Tsybakov07, , Lemma 3.1) to bound the first term on the right hand side. This peeling argument became classical in the literature and thus is omitted here. Moreover, using Assumption 2 the second term can be bounded with the same rate as the first term. These arguments would imply that there exists $C\geq 0$ such that for all $n,N\geq 1$ it holds that

[TABLE]

It remains to upper bound the second term in the bound above, to this end we recall the classical Dvoretzky-Kiefer-Wolfowitz inequality Massart90

Lemma 3 (Dvoretzky-Kiefer-Wolfowitz inequality)

Given $N\geq 0$ , let $Z_{1},\ldots,Z_{N}$ be i.i.d. real-valued random variables with cumulative distribution function $F_{Z}$ , denote by $\hat{F}_{Z}$ the cumulative distribution function with respect to the empirical measure, that is, with respect to $\frac{1}{N}\sum_{i=1}^{N}\delta_{Z_{i}}$ , then for every $t>0$ we have

[TABLE]

Let us apply this lemma to $Z_{i}\vcentcolon=\hat{\eta}(X_{i})$ , conditionally on $\mathcal{D}_{n}$ these random variables are *i.i.d. *real-valued, thus for all $t>0$

[TABLE]

Finally, to conclude the upper bound we apply this exponential concentration to upper bound the expectation as

[TABLE]

where we used the shortcut $\Delta_{(\mathcal{D}_{N},\mathcal{D}_{n})}$ for the desired empirical process. Combining all the bounds we conclude.

Appendix D Proof of the lower bound

Proof

The proof is similar to the one used in Audibert_Tsybakov07 and in Rigollet_Vert09 and is based on Assouad lemma. Similarly, we define the regular grid on $\mathbb{R}^{d}$ as

[TABLE]

and denote by $n_{q}(x)\in G_{q}$ as the closest point to of the grid $G_{q}$ to the point $x\in\mathbb{R}^{d}$ . Such a grid defines a partition of the unit cube $[0,1]^{d}\subset\mathbb{R}^{d}$ denoted by $\mathcal{X}^{\prime}_{1},\ldots,\mathcal{X}^{\prime}_{q^{d}}$ . Besides, denote by $\mathcal{X}^{\prime}_{-j}\coloneqq\{x\in\mathbb{R}^{d}\,\mathrel{\mathop{\mathchar 58\relax}}\,-x\in\mathcal{X}_{j}^{\prime}\}$ for all $j=1,\ldots,q^{d}$ . For a fixed integer $m\leq q^{d}$ and for any $j\in\{1,\ldots,m\}$ define $\mathcal{X}_{i}\coloneqq\mathcal{X}_{i}^{\prime}$ , $\mathcal{X}_{-i}\coloneqq\mathcal{X}_{-i}^{\prime}$ . For every $\sigma\in\{-1,1\}^{m}$ we define a regression function $\eta_{\sigma}$ as

[TABLE]

where $\rho,\varphi,\xi,\tau$ are to be specified and $\mathcal{B}(0,\sqrt{d}+\rho),\mathcal{B}(0,\sqrt{d})$ are Euclidean balls of radius $\sqrt{d}+\rho$ and $\sqrt{d}$ respectively. The definition of the function $\varphi$ is exactly the same as in Audibert_Tsybakov07 . That is, $\varphi\coloneqq C_{\varphi}q^{-\beta}u(q\left\lVert x-n_{q}(x)\right\rVert_{2})$ with some non-increasing infinitely differentiable function such that $u(x)=1$ for $x\in[0,1/4]$ and $u(x)=0$ for $x\geq 1/2$ . The function $\xi$ is defined as $\xi(x)=(\tau-1/4)v([\left\lVert x\right\rVert_{2}-\sqrt{d}]/\rho)+1/4$ , where $v$ is non-decreasing infinitely differentiable function such that $v(x)=0$ for $x\leq 0$ and $v(x)=1$ for $x\geq 1$ . The constant $\rho$ is chosen big enough to ensure that $|\xi(x)-\xi_{x}(x^{\prime})|\leq L\left\lVert x-x^{\prime}\right\rVert_{2}^{\beta}$ for any $x,x^{\prime}\in\mathbb{R}^{d}$ .

For any $\sigma\in\{-1,1\}^{m}$ we construct a marginal distribution $P_{X}$ which is independent of $\sigma$ and has a density $\mu$ *w.r.t. *to the Lebesgue measure on $\mathbb{R}^{d}$ . Fix some $0<w\leq m^{-1}$ and set $A_{0}$ a Euclidean ball in $\mathbb{R}^{d}$ that has an empty intersection with $\mathcal{B}(0,\sqrt{d}+\rho)$ and whose Lebesgue measure is $\lambda(A_{0})=1-mq^{-d}$ . The density $\mu$ is constructed as

•

$\mu(x)=\frac{w}{\lambda(\mathcal{B}(0,(4q)^{-1}))}$ for every $z\in G_{q}$ and every $x\in\mathcal{B}(z,(4q)^{-1}))$ or $x\in\mathcal{B}(-z,(4q)^{-1}))$ ,

•

$\mu(x)=\frac{1-2mw}{\lambda(A_{0})}$ for every $x\in A_{0}$ ,

•

$\mu(x)=0$ for every other $x\in\mathbb{R}^{d}$ .

To complete the construction it remain to specify the value of $\tau\in[0,1]$ . The idea here is to force the optimal threshold $\theta^{*}$ to be equal to some predefined constant using the additional degree of freedom provided by the parameter $\tau$ . Importantly, this optimal threshold should not depend on the binary vector $\sigma\in\{-1,1\}^{m}$ . To achieve this recall that we set $\theta^{*}=1/4$ and show that there exists an appropriate choice of $\tau$ . First, recall that the optimal threshold $\theta^{*}$ satisfies

[TABLE]

Define $b^{\prime}=\int_{\mathcal{X}_{1}}\varphi(x)\mu(x)dx/\int_{\mathcal{X}_{1}}\mu(x)dx$ and put $\theta^{*}=1/4$ , notice that the left hand side of the last equality for every $\sigma\in\{-1,1\}^{m}$ is given by

[TABLE]

For the right hand side $\mathbb{E}_{\mu}(\eta_{\sigma}(X)-1/4)_{+}$ , there are two cases $\tau>1/4$ and $0<\tau\leq 1/4$ , one can easily show that as long as $b^{\prime}\leq 1/8$ there are no values of $\tau$ which allow to fix $\theta^{*}=1/4$ . Therefore, $\tau>1/4$ and we can write for every $\sigma\in\{-1,1\}$

[TABLE]

Finally, the parameter $\tau$ must satisfy the following equality

[TABLE]

solving for $\tau$ we get

[TABLE]

If $mw\leq 1/2$ we can ensure that the value of $\tau\leq 1$ , that is, it is a valid choice for the regression function. Let us demonstrate that the margin assumption 1 holds for an appropriate choice of $m$ and $w$ . Define $x_{0}=(1/2q,\ldots,1/2q)^{\top}$ , then for every $\sigma\in\{-1,1\}$ we have

[TABLE]

as long as $b^{\prime}\leq 3/24$ we can continue as

[TABLE]

Therefore, if $mw$ is of order $q^{-\alpha\beta}$ the margin assumption is satisfied with $\delta_{0}=1/12$ . The strong density assumption can be checked similarly to Audibert_Tsybakov07 . To finish the prove, for every $\sigma\in\{-1,1\}^{m}$ we denote by $P^{\sigma}$ the distribution of $(X,Y)$ with the marginal $P_{X}$ and the regression function $\eta^{\sigma}$ . Thus, one can write for any $\hat{g}$

[TABLE]

where $\mathbb{E}^{\sigma}_{(\mathcal{D}_{n},\mathcal{D}_{N})}$ is the expectation taken *w.r.t. *to the *i.i.d. *realizations of $\mathcal{D}_{n}$ and $\mathcal{D}_{N}$ from $P^{\sigma}$ and $P_{X}$ respectively, and $\operatorname{sign}(i)=1$ if $i>0$ and $\operatorname{sign}(i)=-1$ if $i<0$ . The rest of the proof is obtained following the proof of (Audibert04, , Lemma 5.1.) and in particular the chain of inequalities in (Audibert04, , Eq. (6.26)). That is, we get for some $C>0$ independent from $N,n$

[TABLE]

Finally, we conclude by setting the parameters $m,w,q$ as

[TABLE]

Note that thanks to the condition $\alpha\beta\leq d$ such a choice is is always valid for appropriately chosen constants $\bar{C},C^{\prime},C^{\prime\prime}$ .

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Audibert, J.Y.: Aggregated estimators and empirical complexity for least square regression. Ann. Inst. H. Poincaré Probab. Statist. 40 (6), 685–736 (2004)
2(2) Audibert, J.Y., Tsybakov, A.B.: Fast learning rates for plug-in classifiers. Ann. Statist. 35 (2), 608–633 (2007)
3(3) Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: risk bounds and structural results. J. Mach. Learn. Res. 3 (Spec. Issue Comput. Learn. Theory), 463–482 (2002)
4(4) Bobkov, S., Ledoux, M.: One-dimensional empirical measures, order statistics and Kantorovich transport distances (2016). To appear in the Memoirs of the Amer. Math. Soc.
5(5) Chzhen, E., Denis, C., Hebiri, M.: Minimax semi-supervised confidence sets for multi-class classification (2019). Preprint, https://arxiv.org/abs/1904.12527
6(6) Conte, S., Boor, C.: Elementary Numerical Analysis: An Algorithmic Approach, 3rd edn. Mc Graw-Hill Higher Education (1980)
7(7) Dembczynski, K., Kotłowski, W., Koyejo, O., Natarajan, N.: Consistency analysis for binary classification revisited. In: ICML, pp. 961–969. JMLR. org (2017)
8(8) Dvoretzky, A., Kiefer, J., Wolfowitz, J.: Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Statist. 27 (3), 642–669 (1956)

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Optimal rates for F-score binary classification

Abstract

1 Introduction

2 The problem formulation

Theorem 2.1

Lemma 1

Remark 1

Assumption 1** (α\alphaα-margin assumption)**

Assumption 2** (Existence of estimator)**

Assumption 3** (Lower bounded P(Y=1)\mathbb{P}(Y=1)P(Y=1))**

3 Related works and contributions

4 Main results

4.1 Upper bound

Theorem 4.1

Theorem 4.2

Lemma 2

4.2 Lower bound

Definition 1 (Hölder smoothness)

Assumption 4** ((β,L)(\beta,L)(β,L)-Hölder regression function)**

Definition 2

Assumption 5** (Strong density assumption)**

Theorem 4.3 (Audibert_Tsybakov07 )

Corollary 1

Theorem 4.4

5 Conclusion

Acknowledgements.

Appendix A Bayes classifier and Lemma 1

Proof

Appendix B Proof of Lemma 2

Proof

Appendix C Proof of the upper bound

Lemma 3 (Dvoretzky-Kiefer-Wolfowitz inequality)

Appendix D Proof of the lower bound

Proof

Assumption 1 ( $\alpha$ -margin assumption)

Assumption 2 (Existence of estimator)

Assumption 3 (Lower bounded $\mathbb{P}(Y=1)$ )

Assumption 4 ( $(\beta,L)$ -Hölder regression function)

Assumption 5 (Strong density assumption)