Rademacher Complexity and Generalization Performance of Multi-category   Margin Classifiers

Khadija Musayeva (ABC); Fabien Lauer (ABC); Yann Guermeur (ABC)

arXiv:1812.00584·math.ST·December 4, 2018·Neurocomputing

Rademacher Complexity and Generalization Performance of Multi-category Margin Classifiers

Khadija Musayeva (ABC), Fabien Lauer (ABC), Yann Guermeur (ABC)

PDF

Open Access

TL;DR

This paper derives a new risk bound for multi-category margin classifiers, improving the dependency on the number of categories by using Rademacher complexity and a novel combinatorial metric entropy bound.

Contribution

It introduces a new combinatorial metric entropy bound that enhances the theoretical understanding of generalization in multi-category margin classifiers.

Findings

01

Improved risk bounds with better dependency on the number of categories

02

Linking Rademacher complexity to metric entropy via chaining

03

Enhanced theoretical guarantees under minimal assumptions

Abstract

One of the main open problems in the theory of multi-category margin classification is the form of the optimal dependency of a guaranteed risk on the number C of categories, the sample size m and the margin parameter gamma. From a practical point of view, the theoretical analysis of generalization performance contributes to the development of new learning algorithms. In this paper, we focus only on the theoretical aspect of the question posed. More precisely, under minimal learnability assumptions, we derive a new risk bound for multi-category margin classifiers. We improve the dependency on C over the state of the art when the margin loss function considered satisfies the Lipschitz condition. We start with the basic supremum inequality that involves a Rademacher complexity as a capacity measure. This capacity measure is then linked to the metric entropy through the chaining method. In…

Tables1

Table 1. Table 1: Assumptions made on the sample size m 𝑚 m and the number of categories C 𝐶 C with respect to the growth rate d 𝒢 subscript 𝑑 𝒢 d_{\mathcal{G}} of the fat-shattering dimensions in Hypothesis 1 .

Growth rate	Assumptions
$d_{𝒢} ⩽ 2$	$m > C > 4$
$d_{𝒢} > 2$	$m ⩾ C^{1.2}$ , $C > 4$

Equations260

\forall (x, k) \in Z, f_{g} (x, k) = \frac{1}{2} (g_{k} (x) - l \neq = k max g_{l} (x)) .

\forall (x, k) \in Z, f_{g} (x, k) = \frac{1}{2} (g_{k} (x) - l \neq = k max g_{l} (x)) .

\forall t \in R, ϕ (t) = \mathds 1_{{t ⩽ 0}} .

\forall t \in R, ϕ (t) = \mathds 1_{{t ⩽ 0}} .

L (g) = E_{Z} [ϕ (f_{g} (Z))] = P (d r_{g} (X) \neq = Y) .

L (g) = E_{Z} [ϕ (f_{g} (Z))] = P (d r_{g} (X) \neq = Y) .

\forall t \in R, ϕ_{γ} (t) = \mathds 1_{{t ⩽ 0}} + (1 - \frac{t}{γ}) \mathds 1_{{t \in (0, γ]}} .

\forall t \in R, ϕ_{γ} (t) = \mathds 1_{{t ⩽ 0}} + (1 - \frac{t}{γ}) \mathds 1_{{t \in (0, γ]}} .

\forall (x, k) \in Z, f_{g, γ} (x, k) = max (0, min (γ, f_{g} (x, k))),

\forall (x, k) \in Z, f_{g, γ} (x, k) = max (0, min (γ, f_{g} (x, k))),

L_{γ} (g) = E_{Z} [ϕ_{γ} (f_{g, γ} (Z))] .

L_{γ} (g) = E_{Z} [ϕ_{γ} (f_{g, γ} (Z))] .

L_{γ, m} (g) = \frac{1}{m} i = 1 \sum m ϕ_{γ} (f_{g, γ} (Z_{i})) .

L_{γ, m} (g) = \frac{1}{m} i = 1 \sum m ϕ_{γ} (f_{g, γ} (Z_{i})) .

\hat{R}_{n} (F) = E_{σ_{n}} [f \in F sup \frac{1}{n} i = 1 \sum n σ_{i} f (T_{i}) T_{n}]

\hat{R}_{n} (F) = E_{σ_{n}} [f \in F sup \frac{1}{n} i = 1 \sum n σ_{i} f (T_{i}) T_{n}]

d_{p, t_{n}} (f, f^{'}) = {(\frac{1}{n} \sum_{i = 1}^{n} ∣ f (t_{i}) - f^{'} (t_{i}) ∣^{p})^{\frac{1}{p}}, \mbox i f p \in [1, + \infty) max_{1 ⩽ i ⩽ n} ∣ f (t_{i}) - f^{'} (t_{i}) ∣, \mbox i f p = + \infty.

d_{p, t_{n}} (f, f^{'}) = {(\frac{1}{n} \sum_{i = 1}^{n} ∣ f (t_{i}) - f^{'} (t_{i}) ∣^{p})^{\frac{1}{p}}, \mbox i f p \in [1, + \infty) max_{1 ⩽ i ⩽ n} ∣ f (t_{i}) - f^{'} (t_{i}) ∣, \mbox i f p = + \infty.

N_{p} (ϵ, F, n) = t_{n} \in T^{n} sup N (ϵ, F, d_{p, t_{n}})

N_{p} (ϵ, F, n) = t_{n} \in T^{n} sup N (ϵ, F, d_{p, t_{n}})

M_{p} (ϵ, F, n) = t_{n} \in T^{n} sup M (ϵ, F, d_{p, t_{n}}),

M_{p} (ϵ, F, n) = t_{n} \in T^{n} sup M (ϵ, F, d_{p, t_{n}}),

\forall i \in [[1, n]], s_{i} (f_{s_{n}} (t_{i}) - v (t_{i})) ⩾ γ .

\forall i \in [[1, n]], s_{i} (f_{s_{n}} (t_{i}) - v (t_{i})) ⩾ γ .

\forall ϵ \in (0, M_{G}], 1 ⩽ k ⩽ C max ϵ \mbox - d im (G_{k}) ⩽ K_{G} ϵ^{- d_{G}} .

\forall ϵ \in (0, M_{G}], 1 ⩽ k ⩽ C max ϵ \mbox - d im (G_{k}) ⩽ K_{G} ϵ^{- d_{G}} .

\forall g \in G, L (g) ⩽ L_{γ, m} (g) + \frac{2}{γ} R_{m} (F_{g, γ}) + \frac{ln ( \frac{1}{δ} )}{2 m} .

\forall g \in G, L (g) ⩽ L_{γ, m} (g) + \frac{2}{γ} R_{m} (F_{g, γ}) + \frac{ln ( \frac{1}{δ} )}{2 m} .

\hat{R}_{m} (F_{G, γ}) ⩽

\hat{R}_{m} (F_{G, γ}) ⩽

\forall p \in [1, + \infty], ln N (ϵ, F_{G, γ}, d_{p, z_{m}}) ⩽ k = 1 \sum C ln N (\frac{ϵ}{C ^{1/ p}}, G_{k}, d_{p, x_{m}}),

\forall p \in [1, + \infty], ln N (ϵ, F_{G, γ}, d_{p, z_{m}}) ⩽ k = 1 \sum C ln N (\frac{ϵ}{C ^{1/ p}}, G_{k}, d_{p, x_{m}}),

ln N_{p} (ϵ, F, n) ⩽ 2 d (\frac{ϵ}{15 p}) ln \frac{15 e p n M _{F}}{d ( \frac{ϵ}{15 p} ) ϵ};

ln N_{p} (ϵ, F, n) ⩽ 2 d (\frac{ϵ}{15 p}) ln \frac{15 e p n M _{F}}{d ( \frac{ϵ}{15 p} ) ϵ};

ln N_{p} (ϵ, F, n) ⩽ 10 p d (\frac{ϵ}{36 p}) ln (\frac{7 p ^{\frac{1}{7}} M _{F}}{ϵ}) .

ln N_{p} (ϵ, F, n) ⩽ 10 p d (\frac{ϵ}{36 p}) ln (\frac{7 p ^{\frac{1}{7}} M _{F}}{ϵ}) .

ln N_{p} (ϵ, F_{G, γ}, m) ⩽ 2 C d (\frac{ϵ}{30 lo g _{2} ( 2 C )}) ln (\frac{30 e n lo g _{2} ( 2 C ) M _{G}}{ϵ}),

ln N_{p} (ϵ, F_{G, γ}, m) ⩽ 2 C d (\frac{ϵ}{30 lo g _{2} ( 2 C )}) ln (\frac{30 e n lo g _{2} ( 2 C ) M _{G}}{ϵ}),

ln N_{p} (ϵ, F_{G, γ}, m) ⩽ 10 C lo g_{2} (2 C) d (\frac{ϵ}{72 lo g _{2} ( 2 C )}) ln (\frac{14 lo g _{2}^{\frac{1}{7}} ( 2 C ) M _{G}}{ϵ}) .

ln N_{p} (ϵ, F_{G, γ}, m) ⩽ 10 C lo g_{2} (2 C) d (\frac{ϵ}{72 lo g _{2} ( 2 C )}) ln (\frac{14 lo g _{2}^{\frac{1}{7}} ( 2 C ) M _{G}}{ϵ}) .

R_{m} (F_{G, γ}) ⩽

R_{m} (F_{G, γ}) ⩽

\times ⎩ ⎨ ⎧ (ln (C))^{\frac{d _{G}}{2} + \frac{1}{2}}, ln (C) ln (\frac{m}{C}) ln^{\frac{1}{2}} (\frac{m ln ^{\frac{2}{3}} ( C )}{C ^{\frac{1}{3}}}), m^{\frac{1}{2} - \frac{1}{d _{G}}} (ln (C))^{2 - \frac{d _{G}}{2}} ln^{\frac{1}{2}} (\frac{m ^{1 + \frac{1}{d _{G}}}}{ln ( C )}), \mbox i f 0 < d_{G} < 2, \mbox i f d_{G} = 2, \mbox i f d_{G} > 2 \mbox an d m ⩾ C^{1.2} .

\forall r > q > 0, N (ϵ, F, d_{q, t_{n}}) ⩽ N (ϵ, F, d_{r, t_{n}})

\forall r > q > 0, N (ϵ, F, d_{q, t_{n}}) ⩽ N (ϵ, F, d_{r, t_{n}})

\forall f, f^{'} \in F, d_{q, t_{n}} (f, f^{'}) ⩽ d_{r, t_{n}} (f, f^{'}) .

\forall f, f^{'} \in F, d_{q, t_{n}} (f, f^{'}) ⩽ d_{r, t_{n}} (f, f^{'}) .

\hat{R}_{m} (F_{G, γ}) ⩽

\hat{R}_{m} (F_{G, γ}) ⩽

\times [d (\frac{γ 2 ^{- α (d_{G}) j}}{72 lo g _{2} ( 2 C )}) ln (\frac{14 M _{G} lo g _{2}^{\frac{1}{7}} ( 2 C )}{γ 2 ^{- α (d_{G}) j}})]^{1/2}

⩽

\times j = 1 \sum N 2^{- α (d_{G}) (1 - \frac{d _{G}}{2}) j} ln^{\frac{1}{2}} (\frac{14 M _{G} lo g _{2}^{\frac{1}{7}} ( 2 C )}{γ 2 ^{- α (d_{G}) j}}) .

\hat{R}_{m} (F_{G, γ}) ⩽

\hat{R}_{m} (F_{G, γ}) ⩽

\times j = 1 \sum N 2^{- j} ln^{\frac{1}{2}} (\frac{14 M _{G} lo g _{2}^{\frac{1}{7}} ( 2 C )}{γ 2 ^{- \frac{2}{2 - d _{G}} j}})

=

\times j = 1 \sum N (2^{- j} - 2^{- j - 1}) ln^{\frac{1}{2}} (\frac{14 M _{G} lo g _{2}^{\frac{1}{7}} ( 2 C )}{γ 2 ^{- \frac{2}{2 - d _{G}} j}}) .

\hat{R}_{m} (F_{G, γ}) ⩽

\hat{R}_{m} (F_{G, γ}) ⩽

\times \int_{0}^{1/2} ln^{\frac{1}{2}} (\frac{14 M _{G} lo g _{2}^{\frac{1}{7}} ( 2 C )}{γ ϵ ^{\frac{2}{2 - d _{G}}}}) d ϵ .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Face and Expression Recognition

Full text

Rademacher Complexity and Generalization Performance of Multi-category Margin Classifiers

Khadija Musayeva, Fabien Lauer and Yann Guermeur

Abstract

One of the main open problems in the theory of multi-category margin classification is the form of the optimal dependency of a guaranteed risk on the number $C$ of categories, the sample size $m$ and the margin parameter $\gamma$ . From a practical point of view, the theoretical analysis of generalization performance contributes to the development of new learning algorithms. In this paper, we focus only on the theoretical aspect of the question posed. More precisely, under minimal learnability assumptions, we derive a new risk bound for multi-category margin classifiers. We improve the dependency on $C$ over the state of the art when the margin loss function considered satisfies the Lipschitz condition. We start with the basic supremum inequality that involves a Rademacher complexity as a capacity measure. This capacity measure is then linked to the metric entropy through the chaining method. In this context, our improvement is based on the introduction of a new combinatorial metric entropy bound.

1 Introduction

Although the theory of binary pattern classification is well established [1, 2], the theory of multi-category classification is far from being complete. The research in this case addresses problems such as the sample-complexity analysis of empirical risk minimization algorithms [3], or consistency analysis of multi-class loss functions and of specific families of classifiers [4]. Another open question is the optimal dependency of guaranteed risks of multi-category classifiers on the number $C$ of categories and the sample size $m$ . It is all the more the case for the problems that involve a large number of classes. When the considered classifiers are margin ones that take decision based on a score per category, the dependency on the margin parameter $\gamma$ also becomes relevant to the characterization of their generalization performance. If this question has been mainly studied for specific families of classifiers, be it $k$ -nearest neighbors [5], kernel methods [6, 7] and decision trees [8], tackling it under minimal learnability assumptions remains a challenging task. This paper focuses on obtaining guaranteed risks under such assumptions.

The first step in the derivation of risk bounds is the choice of the margin loss function. Two families of margin loss functions can be distinguished: indicator margin loss functions and those that satisfy the Lipschitz condition. Deriving guaranteed risks with the optimal dependency on the parameters of interest is relatively straightforward in the first case [9]. The family of Lipschitz continuous loss functions, on the other hand, offers a richer setting to this task. In this case, one can obtain a guaranteed risk whose control term involves a Rademacher complexity [10]. Then a sequence of transitions between capacity measures is performed. More precisely, using the chaining method one can control the Rademacher complexity of a function class through the sum of its metric entropies [11]. A combinatorial bound is then used to estimate the metric entropy of the class in terms of its combinatorial dimension. In this sequence of transitions, one can choose the capacity measure at the level of which to reduce the multi-class problem to an ensemble of bi-class ones, that is, to perform a decomposition. Performing a decomposition for Rademacher complexity, a linear dependency on $C$ was obtained in [8]. This dependency has been improved to a sublinear one in [9] by postponing the decomposition to the level of metric entropy.

In this paper, we exactly follow the pathway of [9]. Our contribution is based on the following line of reasoning. Theorem 7 of [9] provides a sublinear (but still close to linear) dependency on $C$ using a decomposition result for metric entropies (Lemma 1 of [9]) in $L_{p}$ -norm with $p=2$ and the combinatorial metric entropy bound of [12]. On the other hand, using the decomposition result with $p=\infty$ and the $L_{\infty}$ -norm metric entropy bound of [13], one can obtain a radical dependency on $C$ , this, however, at the expense of a degraded dependency on $m$ . Hence, we consider the values of $p$ in between these two extreme ones, and extend the $L_{2}$ -norm bound of [12] to $L_{p}$ -norms with integer $p>2$ . When applied in the chaining, it results in an improved dependency on $C$ over that of Theorem 7 of [9]. Specifically, we obtain a radical dependency on $C$ (up to logarithmic factors) without worsening the dependencies on $m$ and $\gamma$ .

The organization of the paper is as follows. In the next section, we introduce the theoretical framework and describe the transitions between the capacity measures. Then, Section 3 gives the new combinatorial metric entropy bound, whose proof can be found in A. In Section 4, we demonstrate how this result can be applied in the chaining to derive an improved upper bound on the Rademacher complexity. Conclusions and ongoing research are highlighted in Section 5. All intermediate results used in the proofs are collected in B.

Notation

We denote the set of strictly positive reals by $\mathbb{R}_{+}$ , and let $\mathbb{N}^{*}=\mathbb{N}\setminus\{0\}$ . $\left[\left[\hskip 2.15277pti,j\hskip 2.15277pt\right]\right]$ stands for the set of integers from $i$ to $j$ . $\mathds{1}_{A}$ stands for the indicator function for the event $A$ such that $\mathds{1}_{A}=1$ if $A$ occurs, and [math] otherwise. $\lfloor x\rfloor$ is the greatest integer less than or equal to $x$ , $\lceil x\rceil$ is the smallest integer greater than or equal to $x$ .

2 Theoretical Framework

We consider $C$ -category pattern classification problems with $C\geqslant 3$ . Each object is represented by its description $x\in\mathcal{X}$ and the categories $y$ belong to $\mathcal{Y}=\left[\left[\hskip 2.15277pt1,C\hskip 2.15277pt\right]\right]$ . We assume that $\left(\mathcal{X},\mathcal{A}_{\mathcal{X}}\right)$ and $\left(\mathcal{Y},\mathcal{A}_{\mathcal{Y}}\right)$ are measurable spaces. Denote by $\mathcal{A}_{\mathcal{X}}\otimes\mathcal{A}_{\mathcal{Y}}$ the product sigma-algebra on $\mathcal{X}\times\mathcal{Y}$ . We assume that the link between descriptions and categories can be characterized by an unknown probability measure $P$ on the measurable space $\left(\mathcal{X}\times\mathcal{Y},\mathcal{A}_{\mathcal{X}}\otimes\mathcal{A}_{\mathcal{Y}}\right)$ . Let $Z=\left(X,Y\right)$ be a random pair with values in $\mathcal{Z}=\mathcal{X}\times\mathcal{Y}$ , distributed according to $P$ . The available information on $P$ is limited to an $m$ -sample $\mathbf{Z}_{m}=\left(Z_{i}\right)_{1\leqslant i\leqslant m}=\left(\left(X_{i},Y_{i}\right)\right)_{1\leqslant i\leqslant m}$ distributed according to $P^{m}$ . In the following, we distinguish the sample size $m$ from the generic notation $n$ which stands for a number of points in a set that needs not be a realization of a random sample.

We consider multi-category margin classifiers that take their decisions based on a score per category and focus on those that implement classes of functions with values in a hypercube of $\mathbb{R}^{C}$ (thus, in contrast to [7], no correlation assumption is made on the component functions). Most well-known classifiers, such as neural networks [14], support vector machines [4], and nearest neighbors [5] are margin classifiers.

Definition 1 (Multi-category margin classifiers).

Let $\mathcal{G}=\prod_{k=1}^{C}\mathcal{G}_{k}$ be a class of functions from $\mathcal{X}$ into $\left[-M_{\mathcal{G}},M_{\mathcal{G}}\right]^{C}$ with $M_{\mathcal{G}}\in\left[1,+\infty\right)$ . For each $g=\left(g_{k}\right)_{1\leqslant k\leqslant C}\in\mathcal{G}$ , $dr_{g}$ is a multi-category margin classifier such that for all $x\in\mathcal{X}$ , $dr_{g}(x)=\operatornamewithlimits{argmax}_{1\leqslant k\leqslant C}g_{k}(x)$ , breaking ties with a dummy category $*$ .

To sidestep the complications that might arise from the measurability of a supremum of an uncountable set, we assume that the classes $\mathcal{G}_{k}$ , and in general, all sets of functions considered in the sequel satisfy the “image admissibility Suslin” condition [15, page 101].

The classification performance of margin classifiers can be characterized based on the following functions.

Definition 2 (Class $\mathcal{F}_{\mathcal{G}}$ of margin functions).

Let $\mathcal{G}$ be as in Definition 1. For any $g\in\mathcal{G}$ , the margin function $f_{g}:\mathcal{Z}\rightarrow[-M_{\mathcal{G}},M_{\mathcal{G}}]$ is

[TABLE]

Then, we define $\mathcal{F}_{\mathcal{G}}=\left\{f_{g}:g\in\mathcal{G}\right\}.$

Given $g\in\mathcal{G}$ , $dr_{g}$ misclassifies $\left(x,y\right)$ if $dr_{g}(x)\neq y$ , or equivalently, if $f_{g}\left(x,y\right)\leqslant 0$ . The goal of the learning process is to minimize the probability of error or risk over $\mathcal{G}$ .

Definition 3 (Risk $L$ ).

Let $\mathcal{G}$ be as in Definition 1. Let $\phi$ be the standard indicator loss function defined as

[TABLE]

For any $g\in\mathcal{G}$ , its risk $L(g)$ is

[TABLE]

To make use of the values of functions $f_{g}$ (and not just of their signs) in the assessment of the classification performance, we appeal to the following margin loss function.

Definition 4 (Parameterized truncated hinge loss function $\phi_{\gamma}$ ).

For any $\gamma\in(0,1]$ , the parameterized truncated hinge loss function $\phi_{\gamma}$ is defined as

[TABLE]

It is clear from the definition that $\phi_{\gamma}$ dominates the standard indicator loss function given in Definition 3 and that it is Lipschitz continuous. Observe that when this loss function is applied to $f_{g}$ , the values of the latter strictly above $\gamma$ and below zero become irrelevant to the estimation of the classification accuracy. Taking benefit from this fact, we introduce functions $f_{g,\gamma}$ by restricting the codomain of $f_{g}$ to $[0,\gamma]$ for all $g\in\mathcal{G}$ . In [8], a partial restriction is the main source of improvement upon the result of [10] in terms of the dependency on $C$ . The use of the set of functions $f_{g,\gamma}$ leads to even a finer bound, this time in terms of the diameter of the function class as we switch from $2M_{\mathcal{G}}$ to $\gamma$ .

Definition 5 (Class $\mathcal{F}_{\mathcal{G},\gamma}$ of truncated margin functions).

Let $\mathcal{F}_{\mathcal{G}}$ be a class of functions satisfying Definition 2. Fix $\gamma\in(0,1]$ . For any $f_{g}\in\mathcal{F}_{\mathcal{G}}$ , we define $f_{g,\gamma}:\mathcal{Z}\rightarrow[0,\gamma]$ as

[TABLE]

and $\mathcal{F}_{\mathcal{G},\gamma}=\{f_{g,\gamma}:g\in\mathcal{G}\}$ .

For any $g\in\mathcal{G}$ , its risk, $L(g)$ can be upper bounded by the margin risk $L_{\gamma}(g)$ obtained on the basis of the loss function $\phi_{\gamma}$ . It is the $m$ -sample $\mathbf{Z}_{m}$ based estimate of $L_{\gamma}$ that appears in our guaranteed risk.

Definition 6 (Margin risk $L_{\gamma}$ and empirical margin risk $L_{\gamma,m}$ ).

Let $\mathcal{G}$ be a class of functions satisfying Definition 1. Let $\phi_{\gamma}$ be as in Definition 4. Then, for $\gamma\in(0,1]$ , the margin risk $L_{\gamma}$ associated with any $g\in\mathcal{G}$ is

[TABLE]

Its $m$ -sample $\mathbf{Z}_{m}$ based estimate is the empirical margin risk defined as

[TABLE]

In what follows, we give the definitions of the capacity measures we use and outline the transitions between them, which are at the basis of the derivation of our result. We use $\mathcal{F}$ to denote a uniformly bounded class of functions on a generic measurable space $\left(\mathcal{T},\mathcal{A}_{\mathcal{T}}\right)$ . First, we recall the definition of the Rademacher complexity.

Definition 7 (Rademacher complexity).

Let $P_{\mathcal{T}}$ be a probability measure on $\left(\mathcal{T},\mathcal{A}_{\mathcal{T}}\right)$ and $\mathbf{T}_{n}=(T_{i})_{1\leqslant i\leqslant n}$ a sequence of independently distributed according to $P_{\mathcal{T}}$ random variables with values in $\mathcal{T}$ . Let $\boldsymbol{\sigma}_{n}=\left(\sigma_{i}\right)_{1\leqslant i\leqslant n}$ be a Rademacher sequence, i.e.,a sequence of independent random variables uniformly distributed in $\{-1,+1\}$ . Then, the empirical Rademacher complexity of $\mathcal{F}$ given $\mathbf{T}_{n}$ is defined as

[TABLE]

and its Rademacher complexity is $R_{n}\left(\mathcal{F}\right)=\mathbb{E}_{\mathbf{T}_{n}}\left[\hat{R}_{n}\left(\mathcal{F}\right)\right].$

The capacity measures central in the derivation of our result are covering/packing numbers. Their definitions require the introduction of the following empirical pseudo-metrics: for any $f,f^{\prime}\in\mathcal{F}$ and $\mathbf{t}_{n}=(t_{i})_{1\leqslant i\leqslant n}\in\mathcal{T}^{n}$ ,

[TABLE]

Definition 8 (Covering numbers, metric entropy, packing numbers).

The $L_{p}$ -norm $\epsilon$ -covering number of $\mathcal{F}$ , $\mathcal{N}\left(\epsilon,\mathcal{F},d_{p,\mathbf{t}_{n}}\right)$ , is the smallest cardinality of the $\epsilon$ -nets of $\mathcal{F}$ , i.e., subsets $\bar{\mathcal{F}}\subseteq\mathcal{F}$ such that $\forall f\in\mathcal{F}$ there exists $\bar{f}\in\bar{\mathcal{F}}$ such that $d_{p,\mathbf{t}_{n}}(f,\bar{f})<\epsilon$ . The logarithm of $\mathcal{N}\left(\epsilon,\mathcal{F},d_{p,\mathbf{t}_{n}}\right)$ is the metric entropy of $\mathcal{F}$ . A subset $\bar{\mathcal{F}}$ of $\mathcal{F}$ is $\epsilon$ -separated with respect to $d_{p,\mathbf{t}_{n}}$ if, for any two distinct elements $f,f^{\prime}\in\bar{\mathcal{F}}$ , $d_{p,\mathbf{t}_{n}}(f,f^{\prime})\geqslant\epsilon$ . The $\epsilon$ -packing number of $\mathcal{F}$ , $\mathcal{M}\left(\epsilon,\mathcal{F},d_{p,\mathbf{t}_{n}}\right)$ , is the maximal cardinality of its $\epsilon$ -separated subsets. The uniform covering and packing numbers are

[TABLE]

and

[TABLE]

respectively.

The capacity measures appearing last in our bounds are combinatorial dimensions. They provide useful information about whether the class of interest uniformly satisfies the classical limit theorems [16].

Definition 9 (Fat-shattering dimension [17], strong dimension [13]).

For $\gamma\in\mathbb{R}_{+}$ , a subset $S=\left\{t_{i}:1\leqslant i\leqslant n\right\}$ of $\mathcal{T}$ is said to be ${\gamma}$ -shattered by $\mathcal{F}$ if there is a function $v:S\rightarrow\mathbb{R}$ such that, for every vector $\mathbf{s}_{n}=\left(s_{i}\right)_{1\leqslant i\leqslant n}\in\left\{-1,1\right\}^{n}$ , there is a function $f_{\mathbf{s}_{n}}\in\mathcal{F}$ satisfying

[TABLE]

The fat-shattering dimension of $\mathcal{F}$ at scale $\gamma$ , $\gamma\mbox{-dim}\left(\mathcal{F}\right)$ , is the maximal cardinality of a subset of $\mathcal{T}$ ${\gamma}$ -shattered by $\mathcal{F}$ , if such a maximum exists. Otherwise, $\gamma\mbox{-dim}\left(\mathcal{F}\right)=\infty$ . For a class $\mathcal{F}$ of integer valued functions, the notion of strong dimension, $S\mbox{-dim}\left(\mathcal{F}\right)$ , is obtained from the definition of the fat-shattering dimension by setting $\gamma=1$ and restricting the co-domain of $v$ to $\mathbb{Z}$ .

As in [9, 18, 19], we make the hypothesis that the fat-shattering dimensions of the classes $\mathcal{G}_{k}$ , $\gamma\mbox{-dim}\left(\mathcal{G}_{k}\right)$ , grow no faster than polynomially with $\gamma^{-1}$ .

Hypothesis 1.

Let $\mathcal{G}$ be a class of functions satisfying Definition 1. We assume that there exists a pair $\left(K_{\mathcal{G}},d_{\mathcal{G}}\right)\in\mathbb{R}^{2}_{+}$ such that

[TABLE]

Among the well-known examples of classifiers that satisfy such an assumption are support vector machines with $d_{\mathcal{G}}=2$ (Theorem 4.6 in [20]) and feedforward neural networks with $d_{\mathcal{G}}=2l$ for $l$ layers (Corollary 27 in [2]). It should be noted that Lipschitz classifiers, such as nearest neighbours also satisfy this assumption as demonstrated by Corollary 4 in [21]. Depending on the growth rate $d_{\mathcal{G}}$ , our assumptions regarding the data are summarized in Table 1.

Our starting point is the following basic supremum inequality that bounds the risk by the empirical margin risk plus a control term based on a Rademacher complexity.

Theorem 1 (Theorem 5 in [9]).

Let $\mathcal{G}$ be a class of functions satisfying Definition 1. For $\gamma\in\left(0,1\right]$ , let $\mathcal{F}_{\mathcal{G},\gamma}$ be the class of functions deduced from $\mathcal{G}$ according to Definition 2. For fixed $\gamma\in(0,1]$ and $\delta\in(0,1)$ , with $P^{m}$ probability at least $1-\delta$ ,

[TABLE]

We perform the following sequence of transitions between the capacity measures to derive our result. First, we relate the empirical Rademacher complexity of $\mathcal{F}_{\mathcal{G},\gamma}$ to its metric entropy through the chaining method (see [11]). More precisely, we use the following formulation of the chaining bound due to [9]:

[TABLE]

where $N\in\mathbb{N}^{*}$ and $h:\mathbb{N}\rightarrow\mathbb{R}_{+}$ is a decreasing function satifying $h(0)\geqslant\gamma$ . Next, using Lemma 1 in [9], we decompose the metric entropy of $\mathcal{F}_{\mathcal{G},\gamma}$ in terms of the ones of the classes $\mathcal{G}_{k}$ :

[TABLE]

where $\mathbf{x}_{m}=(x_{i})_{1\leqslant i\leqslant m}\in\mathcal{X}^{m}$ . Finally, our combinatorial bound derived below gives an estimate on the metric entropies of the classes $\mathcal{G}_{k}$ in terms of their fat-shattering dimensions.

3 $L_{p}$ -norm Combinatorial Metric Entropy Bound

We extend the $L_{2}$ -norm metric entropy bound of [12] to $L_{p}$ -norms with $p\in\mathbb{N}^{*}\setminus\{1,2\}$ . The bound of [12] does not depend on the sample size thanks to the use of the probabilistic extraction principle. In our extension we derive two bounds. In one of them, we keep the dependency on the sample size, and in the other, we remove it using the $L_{p}$ -norm generalization of the aforementioned principle. Under Hypothesis 1, depending on the value of $d_{\mathcal{G}}$ , the application of one or the other bound in the chaining allows us to optimize the dependency on $C$ while not degrading the ones on $m$ and $\gamma$ , as will be seen in Section 4.

Specifically, we have the following $L_{p}$ -norm metric entropy bounds, whose proof is given in A.

Theorem 2.

Let $\mathcal{F}$ be a class of functions from $\mathcal{T}$ into $\left[-M_{\mathcal{F}},M_{\mathcal{F}}\right]$ with $M_{\mathcal{F}}\in[1,+\infty)$ . For $\epsilon\in\left(0,M_{\mathcal{F}}\right]$ , let $d\left(\epsilon\right)=\epsilon\text{-dim}\left(\mathcal{F}\right)$ . For all values of $p\in\mathbb{N}^{*}\setminus\{1,2\}$ and $\epsilon\in\left(0,M_{\mathcal{F}}\right]$ ,

(a) if $n\geqslant d\left(\frac{\epsilon}{15p}\right)$ , then

[TABLE]

(b) if $n\geqslant d\left(\frac{\epsilon}{37p}\right)$ , then

[TABLE]

From (2) one can see that, based on $\displaystyle{C^{\frac{1}{p}}=2^{\left(\frac{1}{p}\log_{2}(C)\right)}}$ , the dependency on $C$ in the scale of covering numbers can be eliminated for all $p\geqslant\log_{2}(C)$ . The combination of the decomposition formula (2) with Theorem 2 using $p=\lceil\log_{2}(C)\rceil$ for $C>4$ yields the following result.

Corollary 1.

Let $\mathcal{G}$ be a class of functions as in Definition 1. For $\gamma\in\left(0,1\right]$ , let $\mathcal{F}_{\mathcal{G},\gamma}$ be the class of functions deduced from $\mathcal{G}$ according to Definition 5. For $\epsilon\in\left(0,M_{\mathcal{G}}\right]$ , let $d\left(\epsilon\right)=\max_{1\leqslant k\leqslant C}\epsilon\text{-dim}\left(\mathcal{G}_{k}\right)$ . Then, for $\epsilon\in\left(0,\gamma\right]$ and $C>4$ ,

[TABLE]

and

[TABLE]

Proof.

Inequality (3) follows from the application of (2) and part (a) of Theorem 2 (where we drop $d(\epsilon)$ from the denominator inside the logarithm as it is greater than one), along with the fact that $C^{1/\lceil\log_{2}(C)\rceil}<2$ and $\lceil\log_{2}(C)\rceil<\log_{2}(2C)$ . We obtain Inequality (4) in a similar way using part (b) of Theorem 2 instead. ∎

4 Bound on the Rademacher complexity

As it was noted in [18], under Hypothesis 1, the growth rate of the fat-shattering dimension has a dramatic effect on the behavior of the Rademacher complexity of the function class. The availability of two kinds of metric entropy bounds allows us to adapt to this impact in the chaining so as to optimize the dependency on $C$ without worsening those on $m$ and $\gamma$ . Under the aforementioned hypothesis, two cases can be distinguished. For $d_{\mathcal{G}}\in(0,2)$ , the formula (1) can be upper bounded by an integral and the use of the dimension-free bound (4) leads to the optimized result. For $d_{\mathcal{G}}\geqslant 2$ , such a result is obtained from the application of (3) in (1). The second case can also be characterized by the fact that there is a freedom in the choice of the number $N$ of steps to construct the chaining. To optimize this construction when $d_{\mathcal{G}}>2$ , we make the non-restrictive assumption that $m$ is greater than a small power of $C$ .

Theorem 3.

Let $\mathcal{G}$ be a class of functions as in Definition 1. For $\gamma\in\left(0,1\right]$ , let $\mathcal{F}_{\mathcal{G},\gamma}$ be the class of functions deduced from $\mathcal{G}$ according to Definition 2. Then, under Hypothesis 1, there is a function $K\left(\gamma,d_{\mathcal{G}},K_{\mathcal{G}}\right)$ such that for all $C>4$ ,

[TABLE]

Compared to Theorem 7 of [9], one can see that in all three cases, the dependency on $C$ is improved: the powers of $C$ are replaced by powers of $\ln(C)$ without losing in the dependencies on $m$ and $\gamma$ . It is interesting to note that, in the third case, when $d_{\mathcal{G}}\geqslant 4$ , which is true for instance for feedforward neural networks (see Corollary 27 in [2]), the dependency on $C$ is slightly better than radical. This is, however, at the cost of the constant factor $d^{d_{\mathcal{G}}}_{\mathcal{G}}$ .

Proof of Theorem 3.

For all $j\in\mathbb{N}$ , we set $h(j)=\gamma 2^{-\alpha(d_{\mathcal{G}})j}$ with $\alpha(d_{\mathcal{G}})>0$ for all $d_{\mathcal{G}}\in\mathcal{R}^{*}_{+}$ in (1). In the following, we use the relation

[TABLE]

which follows directly from the fact that

[TABLE]

First case: $d_{\mathcal{G}}\in(0,2)$

This is the only case where Pollard’s entropy condition [16] is satisfied. For this case we could directly use Dudley’s integral formula (Formula 33 in [9]), however, to optimize with respect to constants, we start from (1) and upper bound it by an integral in the following way.

Apply (5) and (4) in sequence to the right-hand side of (1) and use Hypothesis 1 to get

[TABLE]

Letting $\displaystyle{\alpha(d_{\mathcal{G}})=\frac{2}{2-d_{\mathcal{G}}}}$ , we obtain

[TABLE]

Taking $N\rightarrow\infty$ , we can upper bound the last expression as

[TABLE]

Denote $K=14M_{\mathcal{G}}\log^{\frac{1}{7}}_{2}\left(2C\right)/\gamma$ and let us now compute the integral

[TABLE]

Set $\epsilon=K^{\frac{2-d_{\mathcal{G}}}{2}}e^{-t^{2}}$ . Then,

[TABLE]

Applying the integration by parts formula, we obtain

[TABLE]

Consequently,

[TABLE]

Second case: $d_{\mathcal{G}}\geqslant 2$

In this case, we apply (5) and (3) to (1) and use Hypothesis 1 to get

[TABLE]

Unlike the first case, we now control the number of steps $N$ in (6) through the parameters of interest, $C$ and $m$ . The aim is to optimize the dependencies with respect to them while making sure that (i) $N$ is a strictly positive integer and (ii) as $m\rightarrow\infty$ , $N\rightarrow\infty$ .

Now, if $d_{\mathcal{G}}=2$ , set $\alpha(d_{\mathcal{G}})=1$ . Thus, from (6), we have

[TABLE]

Setting $\displaystyle{N=\left\lceil\log_{2}\left(\sqrt{\frac{m}{C}}\right)\right\rceil}$ and bounding the series, we obtain

[TABLE]

For the final case, $d_{\mathcal{G}}>2$ , we set $\displaystyle{\alpha(d_{\mathcal{G}})=\frac{2}{d_{\mathcal{G}}-2}}$ in (6) and bound the geometric series:

[TABLE]

Now, let $\displaystyle{N=\left\lceil\frac{d_{\mathcal{G}}-2}{2d_{\mathcal{G}}}\log_{2}\left(\frac{m}{\log^{2d_{\mathcal{G}}}_{2}(2C)^{\frac{1}{d_{\mathcal{G}}}}}\right)\right\rceil}$ . Note that, with the assumption $m\geqslant C^{1.2}$ , $m>\log^{2d_{\mathcal{G}}}_{2}(2C)^{\frac{1}{d_{\mathcal{G}}}}$ for all $d_{\mathcal{G}}>2$ and thus, $N$ is a strictly positive integer. Applying it to (7), we get

[TABLE]

∎

5 Conclusions

We derived a sharper risk bound for multi-category margin classifiers following the pathway of [9]. In this pathway, the first capacity measure that appears in the control term of the guaranteed risk is a Rademacher complexity. It is then related to the metric entropy through the chaining method. Using a decomposition for metric entropy, we transition from the multi-class setting to the bi-class one. Finally, a combinatorial bound gives an estimate on the metric entropy in terms of the combinatorial dimension. The metric entropy bound used in [9] is the $L_{2}$ -norm one of [12], which in this paper we generalized to $L_{p}$ -norms with integer $p>2$ . This generalization resulted in an improved dependency on the number $C$ of categories compared to [9] without worsening the dependency on the sample size $m$ nor the one on the margin parameter $\gamma$ .

So far, to get an explicit dependency on $C$ under minimal learnability assumptions, a transition from the multi-class case to the bi-class one has been been performed at the level of one of two capacity measures. Realizing it at the level of a Rademacher complexity, a linear dependency on $C$ was obtained in [8]. In this paper, as in [9], we showed that postponing it to the level of metric entropy, this dependency can be improved to a sublinear one. The case that remains to be studied is a decomposition at the level of a combinatorial dimension, more precisely, at that of the fat-shattering dimension. The goal is to complete the picture of the impact that performing a decomposition at the level of one of three different capacity measures has on the dependencies on $C$ , $m$ and $\gamma$ .

Appendix A Proof of Theorem 2

Let $\mathcal{T}_{n}=\{t_{i}:1\leqslant i\leqslant n\}\subset\mathcal{T}$ and $\boldsymbol{t}_{n}=(t_{i})_{1\leqslant i\leqslant n}$ . Let $\mathcal{F}_{\epsilon}$ be an $\epsilon$ -separated with respect to the pseudo-metric $d_{p,\boldsymbol{t}_{n}}$ subset of $\mathcal{F}$ of maximal cardinality. By definition, $|\mathcal{F}_{\epsilon}|=\mathcal{M}\left(\epsilon,\mathcal{F},d_{p,\boldsymbol{t}_{n}}\right)=\mathcal{M}\left(\epsilon,\left.\mathcal{F}_{\epsilon}\right|_{\mathcal{T}_{n}},d_{p,\boldsymbol{t}_{n}}\right)=|\left.\mathcal{F}_{\epsilon}\right|_{\mathcal{T}_{n}}\!|$ , where $\left.\mathcal{F}_{\epsilon}\right|_{\mathcal{T}_{n}}$ denotes the class $\mathcal{F}_{\epsilon}$ whose domain is restricted to $\mathcal{T}_{n}$ . We distinguish three major steps in the proof: i) discretize functions in the set $\left.\mathcal{F}_{\epsilon}\right|_{\mathcal{T}_{n}}$ , ii) demonstrate that the set of discretized functions is separated, and iii) upper bound the cardinality of the discretized set. The purpose of discretizing the set of real-valued functions is to reduce the original problem into the one that can be addressed by combinatorial means: we upper bound the packing number of the discretized set which is then related to that of the original set via the step (ii).

(a) Let $\epsilon^{\prime}=4(4K_{p})^{1/p}$ , $\displaystyle{\eta=\frac{\epsilon}{\epsilon^{\prime}+2}}$ and $N=\left\lfloor 2M_{\mathcal{F}}/\eta\right\rfloor$ . Define the class $\tilde{\mathcal{F}}^{\eta}$ of functions from $\mathcal{T}_{n}$ into $\left[\left[\hskip 2.15277pt0,N\hskip 2.15277pt\right]\right]$ obtained by the discretization of functions in $\mathcal{F}_{\epsilon}$ in the following way:

[TABLE]

We claim that with such a discretization, for any $\tilde{f}_{1},\tilde{f}_{2}\in\tilde{\mathcal{F}}^{\eta}$ , $d_{p,\boldsymbol{t}_{n}}\left(\tilde{f}_{1},\tilde{f}_{2}\right)\geqslant\epsilon^{\prime}$ . Using $|\lfloor a\rfloor-\lfloor b\rfloor|^{p}\geqslant(\max(0,|a-b|-1))^{p}$ for all $a,b\in\mathbb{R}_{+}$ ,

[TABLE]

where $I$ denotes the set of indices such that $\displaystyle{\frac{1}{\eta}\left|f_{1}(t_{i})-f_{2}(t_{i})\right|\geqslant 1}$ , for all $i\in I$ . Next, by the inverse triangle inequality, $d_{p,\boldsymbol{t}_{n}}(f_{1},f_{2})\geqslant d_{p,\boldsymbol{t}_{n}}(f_{1},0)-d_{p,\boldsymbol{t}_{n}}(f_{2},0)$ for all $f_{1},f_{2}\in\mathcal{F}$ , the right-hand side of the above inequality can be bounded as

[TABLE]

Let $I^{c}$ denote the complement of $I$ . Now, by definition of $\mathcal{F}_{\epsilon}$ ,

[TABLE]

It follows that

[TABLE]

Applying the last inequality to (8) and using $((a-b)+b)\leqslant((a-b)^{1/p}+b^{1/p})^{p}$ with $a,b\in\mathbb{R}_{+}$ and $a\geqslant b$ (where we set $a=\left(\epsilon^{\prime}+2\right)^{p}$ and $b=1$ ), we get

[TABLE]

This proves our claim. Then, it follows that

[TABLE]

The major step that remains to perform to arrive at the claimed bound is to upper bound the right-hand side of (9). To this end, we appeal to Proposition 3. Let $d_{s}$ be the strong dimension of $\tilde{\mathcal{F}}^{\eta}$ . By part (1) of Lemma 3.2 in [13],

[TABLE]

By Lemma 1 and the fact that $p\geqslant 3$ , on the other hand, we have

[TABLE]

We can plug this result in the upper bound on $d_{s}$ based on the fact that the fat-shattering dimension decreases with the scale:

[TABLE]

Now, according to Proposition 3,

[TABLE]

Applying Lemma 1 to the right-hand side of (10) and simplifying it we get

[TABLE]

We apply the relation (9) and the following well-known inequality [23]

[TABLE]

in sequence to the left-hand side of (11). Finally, to obtain the claimed result, we take supremum over $\boldsymbol{t}_{n}\in\mathcal{T}^{n}$ of both sides of the obtained bound.

(b) To derive a dimension-free combinatorial bound we use the $L_{p}$ -norm generalization of probabilistic extraction principle: Lemma 8 of [9]. According to this lemma, there exists a subset $\mathcal{T}_{q}=\{t_{i_{k}}:1\leqslant k\leqslant q\}$ of $\mathcal{T}_{n}$ of cardinality

[TABLE]

such that $\mathcal{F}_{\epsilon}$ is $\epsilon_{1}=\epsilon/2^{\frac{p+1}{p}}$ -separated with respect to $d_{p,\boldsymbol{t}_{q}}$ , with $\boldsymbol{t}_{q}=(t_{i_{k}})_{1\leqslant k\leqslant q}$ . Let $\left.\mathcal{F}_{\epsilon}\right|_{\mathcal{T}_{q}}$ denote the class $\mathcal{F}_{\epsilon}$ whose domain is restricted to $\mathcal{T}_{q}$ . We have

[TABLE]

We let $\displaystyle{\eta=\frac{\epsilon_{1}}{\epsilon^{\prime}+2}}$ and discretize the functions in the set $\left.\mathcal{F}_{\epsilon}\right|_{\mathcal{T}_{q}}$ in a similar way as in part (a):

[TABLE]

Applying the same procedure as in the proof of part (a), we obtain that for any $\tilde{f}_{1},\tilde{f}_{2}\in\tilde{\mathcal{F}}^{\eta}$ , $d_{p,\boldsymbol{t}_{q}}\left(\tilde{f}_{1},\tilde{f}_{2}\right)\geqslant\epsilon^{\prime}$ , and hence

[TABLE]

By Proposition 3,

[TABLE]

where $d_{s}$ is the strong dimension of $\tilde{\mathcal{F}}^{\eta}$ . Plugging the value of $N$ and performing similar computations as in Inequalties (10)-(11) of part (a), we get

[TABLE]

Now, we go back from the discretized set $\tilde{\mathcal{F}}^{\eta}$ to $\mathcal{F}_{\epsilon}$ using the relations (14) and (15) which yield: $|\mathcal{F}_{\epsilon}|\leqslant|\tilde{\mathcal{F}}^{\eta}|$ . Using it and Inequality (13) in (16) give:

[TABLE]

Now, based on $\displaystyle{\ln(u)<\sqrt{u}}$ and by a straightforward computation,

[TABLE]

Next, we bound $d_{s}$ using part (1) of Lemma 3.2 in [13] and Lemma 1:

[TABLE]

Plugging this into (17) and applying Lemma 1 to $K_{p}$ , we obtain

[TABLE]

The claim follows from the application of $|\mathcal{F}_{\epsilon}|=\mathcal{M}(\epsilon,\mathcal{F},d_{p,\boldsymbol{t}_{n}})$ , Inequality (12) and taking supremum over $\boldsymbol{t}_{n}\in\mathcal{T}^{n}$ of both sides of the obtained bound.

Appendix B Technical Results

Lemma 1.

For all $p\in\mathbb{N}^{*}\setminus\left\{1,2\right\}$ ,

[TABLE]

Proof.

By Formula (8.5) in [24, page 119],

[TABLE]

where $\psi_{p}(u)=\sum\limits_{j=0}^{p-1}(-1)^{j}\binom{p}{j+1}(u+1)^{j}\psi_{(p-1)-j}(u)$ is an Eulerian polynomial in $u$ of degree $p-1$ with $\psi_{0}(u)=\psi_{1}(u)=1$ (see page 116 in [24] for explicit form of this polynomial for smaller values of $p$ ). Thus for $u=2$ ,

[TABLE]

We now show by induction that for all $p>2$ , $\displaystyle{\psi_{p}(-2)<\frac{p^{p}}{2}}$ . By definition,

[TABLE]

For the base case, $p=3$ , it is easily seen that $\psi_{3}(-2)<3^{3}/2$ . Now, assume for $k>3$ , $\psi_{k}(-2)<k^{k}/2$ . Then,

[TABLE]

We have that

[TABLE]

Applying it in (18), we obtain

[TABLE]

Now, by the binomial theorem, for all $k>1$ ,

[TABLE]

Consequently,

[TABLE]

where we used the convention that $\displaystyle{\forall k>n,\;{n\choose k}=0}$ . ∎

The results demonstrated hereafter are the generalizations of those in [12]. In the following, we denote $\displaystyle{K_{p}=\sum\limits_{k=1}^{\infty}\frac{k^{p}}{2^{k}}}$ with $p\in\mathbb{N}^{*}\setminus\left\{1,2\right\}$ .

Lemma 2 (After Lemma 5 of [12]).

Let $X$ be a bounded random variable. Let $M_{p}(X)=\left(\mathbb{E}|X|^{p}]\right)^{1/p}$ . Then, there exist numbers $a\in\mathbb{R}$ and $\beta\in(0,1/2]$ , such that

[TABLE]

or vice versa.

Proof.

The proof closely follows that of Lemma 5 of [12] where the variance of $X$ is replaced by its higher moments.

Divide $\mathbb{R}_{+}$ into the intervals $I_{k}$ of length $cM_{p}(X)$ with

[TABLE]

by setting

[TABLE]

Assume the lemma does not hold and let $(\beta_{i})_{i\geqslant 0}$ be a non-increasing sequence of non-negative numbers such that

[TABLE]

and

[TABLE]

For the conclusion of the lemma to fail it should hold that

[TABLE]

Now, assume that for some $k$ , $\beta_{k+1}>\beta_{k}/2$ and consider intervals

[TABLE]

and $J_{2}=(cM_{p}(X)(k+1),\infty)$ . Then,

[TABLE]

and

[TABLE]

By definition of $(\beta_{i})_{i\geqslant 0}$ and by our assumption, $1/2\geqslant\beta_{0}\geqslant\beta_{k}\geqslant\beta_{k+1}>\beta_{k}/2\geqslant 0$ , which means that $\beta_{k}\in(0,1/2]$ . Now, let $a$ be the middle point between the intervals $J_{1}$ and $J_{2}$ and let $\beta=\beta_{k}$ . We have that

[TABLE]

and

[TABLE]

Thus, the lemma holds. This proves (19). Now, by induction from (19) we get that

[TABLE]

We use it in the computation of $M^{p}_{p}(X)$ . By definition,

[TABLE]

By construction, whenever $t\in I_{k}$ , $\mathbb{P}\{X>t\}\leqslant\mathbb{P}\{X>cM_{p}(X)k\}=\mathbb{P}\{X\in\bigcup\limits_{l\geqslant k}I_{l}\}=\sum\limits_{l\geqslant k}\left(\beta_{l}-\beta_{l+1}\right)=\beta_{k}.$ Thus,

[TABLE]

By a similar procedure, it can be proved that

[TABLE]

This produces a contradiction $M^{p}_{p}(X)<M^{p}_{p}(X)/2+M^{p}_{p}(X)/2=M^{p}_{p}(X)$ proving the lemma. ∎

In the following, $\mathcal{T}=\{t_{i}:1\leqslant i\leqslant n\}$ is a finite set and $\mathbf{t}_{n}=\left(t_{i}\right)_{1\leqslant i\leqslant n}$ .

Lemma 3 (After Lemma 6 of [12]).

Let $\mathcal{F}$ be a finite class of functions from $\mathcal{T}$ into $\left[0,M_{\mathcal{F}}\right]$ with $M_{\mathcal{F}}\in\mathbb{R}_{+}$ and $\left|\mathcal{F}\right|>1$ . Assume that for some $\epsilon\in(0,M_{\mathcal{F}}]$ , $\mathcal{F}$ is $\epsilon$ -separated in the pseudo-metric $d_{p,\mathbf{t}_{n}}$ . Then there exist $i\in\left[\left[\hskip 2.15277pt1,n\hskip 2.15277pt\right]\right]$ , $a\in\mathbb{R}$ and $\beta\in(0,1/2]$ such that

[TABLE]

with $p_{1}\geqslant\frac{\beta}{2}$ and $p_{2}\geqslant 1-\beta$ or vice versa.

Proof.

$\mathcal{F}$ can be viewed as a finite probability space $\left(\mathcal{F},\mathcal{A},P_{\mathcal{F}}\right)$ with a uniform probability measure $P_{\mathcal{F}}\left(A\right)=|A|/|\mathcal{F}|$ for any $A\in\mathcal{A}$ . Then, for any two random elements $f,f^{\prime}\in\mathcal{F}$ selected independently according to $P_{\mathcal{F}}$ ,

[TABLE]

By the Minkowski inequality, for any $i\in\left[\left[\hskip 2.15277pt1,n\hskip 2.15277pt\right]\right]$ ,

[TABLE]

Taking it into account in the formula above, we obtain,

[TABLE]

Now, the event that the realizations of $f$ and $f^{\prime}$ are different elements in $\mathcal{F}$ happens with probability $1-1/|\mathcal{F}|$ . Then, by the separation assumption on $\mathcal{F}$ we have

[TABLE]

Thus,

[TABLE]

It means that there exists $i\in\left[\left[\hskip 2.15277pt1,n\hskip 2.15277pt\right]\right]$ , such that

[TABLE]

Next, we apply Lemma 2 to the random element $f$ and take into account that

[TABLE]

and that

[TABLE]

Then, it follows that

[TABLE]

and, similarly,

[TABLE]

Finally, the claim follows from the definition of $P_{\mathcal{F}}$ . ∎

The results given in the sequel call for the introduction of the definition of the $\epsilon$ -separating tree.

Definition 10.

Let $\mathcal{F}$ be a class of functions on $\mathcal{T}$ . A tree $T(\mathcal{F})$ is a finite collection of subsets of $\mathcal{F}$ , such that its any two elements are either disjoint or one of them contains the other. A son of $\bar{\mathcal{F}}\in T(\mathcal{F})$ is its maximal (with respect to inclusion) proper subset. An element of $T(\mathcal{F})$ with no sons is called a leaf. Let $\epsilon>0$ . If every $\bar{\mathcal{F}}\in T(\mathcal{F})$ which is not a leaf has exactly two sons $\bar{\mathcal{F}}_{+},\bar{\mathcal{F}}_{-}$ and

[TABLE]

then $T(\mathcal{F})$ is an $\epsilon$ -separating tree.

Proposition 1 (After Proposition 8 in [12]).

Let $\mathcal{F}$ be a finite class of functions from $\mathcal{T}$ into $\left[0,M_{\mathcal{F}}\right]$ with $M_{\mathcal{F}}\in\mathbb{R}_{+}$ . Assume that for some $\epsilon\in(0,M_{\mathcal{F}}]$ , $\mathcal{F}$ is $\epsilon$ -separated in the pseudo-metric $d_{p,\mathbf{t}_{n}}$ . Then, there is a $\epsilon/4(4K_{p})^{1/p}$ -separating tree of $\mathcal{F}$ with at least $|\mathcal{F}|^{1/2}$ leaves.

Proof.

By Lemma 3, $\mathcal{F}$ has two subsets $\mathcal{F}_{+}$ and $\mathcal{F}_{-}$ such that

[TABLE]

which implies

[TABLE]

The rest of the proof is based on induction on the cardinality of $\mathcal{F}$ and is exactly as in [12], except that the tree is now $\epsilon/4(4K_{p})^{1/p}$ -separated. ∎

Proposition 2 (After Proposition 10 in [12]).

Let $\mathcal{F}$ be a class of functions from $\mathcal{T}$ into a finite set $B$ of integers. Let $S\subseteq{\mathcal{T}}$ and let $v:S\rightarrow B$ . The number of pairs $(S,v)$ strongly shattered by $\mathcal{F}$ is at least the number of leaves in any $1$ -separating tree of $\mathcal{F}$ .

Proof.

The proof follows exactly the one of Proposition 10 in [12], with a few minor technical changes. Let $\bar{\mathcal{F}}$ be a node in a $1$ -separating tree of $\mathcal{F}$ . Let $N(A)$ denote the number of pairs strongly shattered by a set $A$ . For the proof it suffices to show that if $\bar{\mathcal{F}}_{+}$ and $\bar{\mathcal{F}}_{-}$ are two sons of $\bar{\mathcal{F}}$ , then

[TABLE]

By definition of the $1$ -separating tree, there exists $i_{0}\in\left[\left[\hskip 2.15277pt1,n\hskip 2.15277pt\right]\right]$ such that

[TABLE]

It follows that

[TABLE]

If a pair is strongly shattered either by $\bar{\mathcal{F}}_{+}$ or $\bar{\mathcal{F}}_{-}$ , then it is also strongly shattered by $\bar{\mathcal{F}}$ . On the other hand, if a pair $(S,v)$ is strongly shattered both by $\bar{\mathcal{F}}_{+}$ and $\bar{\mathcal{F}}_{-}$ , then $t_{i_{0}}\not\in S$ . Otherwise, there would exist $(f^{\prime}_{+},f^{\prime}_{-})\in\left(\bar{\mathcal{F}}_{+},\bar{\mathcal{F}}_{-}\right)$ satisfying $f^{\prime}_{+}(t_{i_{0}})\leqslant v(t_{i_{0}})-1$ and $f^{\prime}_{-}(t_{i_{0}})\geqslant v(t_{i_{0}})+1$ . Combining it with (21) yields a contradiction:

[TABLE]

Now, consider a pair $\left(S\cup\{t_{i_{0}}\},v^{\prime}\right)$ , where $v^{\prime}(t_{i})=v(t_{i})$ for all $t_{i}\in S$ and $v^{\prime}(t_{i_{0}})=b$ . This pair is shattered by $\bar{\mathcal{F}}$ , but neither by $\bar{\mathcal{F}}_{+}$ or $\bar{\mathcal{F}}_{-}$ . As $S$ is shattered both by $\bar{\mathcal{F}}_{+}$ and $\bar{\mathcal{F}}_{-}$ , then from (21) it follows that,

[TABLE]

similarly,

[TABLE]

It proves the claim that $\bar{\mathcal{F}}$ shatters the pair $\left(S\cup\{t_{i_{0}}\},v^{\prime}\right)$ . Therefore, in both cases we get (20). ∎

The next result is obtained by combining Propositions 1 and 2.

Corollary 2 (After Corollary 11 in [12]).

Let $\mathcal{F}$ be a class of functions from $\mathcal{T}$ into a finite set $B$ of integers. Let $S\subseteq{\mathcal{T}}$ and let $v:S\rightarrow B$ . If $\mathcal{F}$ is $4(4K_{p})^{1/p}$ -separated in the pseudo-metric $d_{p,\mathbf{t}_{n}}$ , then it strongly shatters at least $|\mathcal{F}|^{1/2}$ pairs $(S,v)$ .

Proposition 3 (After Proposition 12 in [12]).

Let $\mathcal{F}$ be a class of functions from $\mathcal{T}$ into $\left[\left[\hskip 2.15277pt0,b\hskip 2.15277pt\right]\right]$ . Let $d_{s}=S\mbox{-dim}(\mathcal{F})$ . Assume $\mathcal{F}$ is $4(4K_{p})^{1/p}$ -separated in the pseudo-metric $d_{p,\mathbf{t}_{n}}$ . Then for any $d\geqslant d_{s}$ ,

[TABLE]

Proof.

By Corollary 2, $\mathcal{F}$ strongly shatters at least $|\mathcal{F}|^{1/2}$ pairs $(S,v)$ . On the other hand, the total number of such pairs for which the cardinality of $S$ is at most $d_{s}$ is bounded above by

[TABLE]

To see this, note that there are at most $\displaystyle{{n\choose k}}$ number of sets $S$ of size $k$ and for each such $S$ the number of functions $h$ is bounded above by $b^{k}$ . Therefore,

[TABLE]

The proof is completed by bounding the right-hand side of the above inequality in a standard way as follows:

[TABLE]

where we used the convention that for all $k>n$ , $\displaystyle{{n\choose k}=0}$ . ∎

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] V. Vapnik, Statistical Learning Theory, John Wiley & Sons, Inc., New York, 1998.
2[2] P. Bartlett, The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network, IEEE Transactions on Information Theory 44 (2) (1998) 525–536.
3[3] A. Daniely, S. Sabato, S. Ben-David, S. Shalev-Shwartz, Multiclass learnability and the ERM principle, in: COLT’11, 2011, pp. 207–232.
4[4] Ü. Doğan, T. Glasmachers, C. Igel, A unified view on multi-class support vector classification, Journal of Machine Learning Research 17(45) (2016) 1–32.
5[5] A. Kontorovich, R. Weiss, Maximum margin muliclass nearest neighbors, in: ICML’14, 2014.
6[6] T. Zhang, Statistical analysis of some multi-category large margin classification methods, Journal of Machine Learning Research 5 (2004) 1225–1251.
7[7] Y. Lei, Ü. Doğan, A. Binder, M. Kloft, Multi-class SV Ms: From tighter data-dependent generalization bounds to novel algorithms, in: NIPS 28, 2015, pp. 2026–2034.
8[8] V. Kuznetsov, M. Mohri, U. Syed, Multi-class deep boosting, in: NIPS 27, 2014, pp. 2501–2509.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Rademacher Complexity and Generalization Performance of Multi-category Margin Classifiers

Abstract

1 Introduction

Notation

2 Theoretical Framework

Definition 1** (Multi-category margin classifiers).**

Definition 2** (Class FG\mathcal{F}_{\mathcal{G}}FG​ of margin functions).**

Definition 3** (Risk LLL).**

Definition 4** (Parameterized truncated hinge loss function ϕγ\phi_{\gamma}ϕγ​).**

Definition 5** (Class FG,γ\mathcal{F}_{\mathcal{G},\gamma}FG,γ​ of truncated margin functions).**

Definition 6** (Margin risk LγL_{\gamma}Lγ​ and empirical margin risk Lγ,mL_{\gamma,m}Lγ,m​).**

Definition 7** (Rademacher complexity).**

Definition 8** (Covering numbers, metric entropy, packing numbers).**

Definition 9** (Fat-shattering dimension [17], strong dimension [13]).**

Hypothesis 1**.**

Theorem 1** (Theorem 5 in [9]).**

3 LpL_{p}Lp​-norm Combinatorial Metric Entropy Bound

Theorem 2**.**

Corollary 1**.**

Proof.

4 Bound on the Rademacher complexity

Theorem 3**.**

Proof of Theorem 3.

First case: dG∈(0,2)d_{\mathcal{G}}\in(0,2)dG​∈(0,2)

Second case: dG⩾2d_{\mathcal{G}}\geqslant 2dG​⩾2

5 Conclusions

Appendix A Proof of Theorem 2

Appendix B Technical Results

Lemma 1**.**

Proof.

Lemma 2** (After Lemma 5 of [12]).**

Proof.

Lemma 3** (After Lemma 6 of [12]).**

Proof.

Definition 10**.**

Proposition 1** (After Proposition 8 in [12]).**

Proof.

Proposition 2** (After Proposition 10 in [12]).**

Proof.

Corollary 2** (After Corollary 11 in [12]).**

Proposition 3** (After Proposition 12 in [12]).**

Proof.

Definition 1 (Multi-category margin classifiers).

Definition 2 (Class $\mathcal{F}_{\mathcal{G}}$ of margin functions).

Definition 3 (Risk $L$ ).

Definition 4 (Parameterized truncated hinge loss function $\phi_{\gamma}$ ).

Definition 5 (Class $\mathcal{F}_{\mathcal{G},\gamma}$ of truncated margin functions).

Definition 6 (Margin risk $L_{\gamma}$ and empirical margin risk $L_{\gamma,m}$ ).

Definition 7 (Rademacher complexity).

Definition 8 (Covering numbers, metric entropy, packing numbers).

Definition 9 (Fat-shattering dimension [17], strong dimension [13]).

Hypothesis 1.

Theorem 1 (Theorem 5 in [9]).

3 $L_{p}$ -norm Combinatorial Metric Entropy Bound

Theorem 2.

Corollary 1.

Theorem 3.

First case: $d_{\mathcal{G}}\in(0,2)$

Second case: $d_{\mathcal{G}}\geqslant 2$

Lemma 1.

Lemma 2 (After Lemma 5 of [12]).

Lemma 3 (After Lemma 6 of [12]).

Definition 10.

Proposition 1 (After Proposition 8 in [12]).

Proposition 2 (After Proposition 10 in [12]).

Corollary 2 (After Corollary 11 in [12]).

Proposition 3 (After Proposition 12 in [12]).