Rademacher Complexity and Generalization Performance of Multi-category Margin Classifiers
Khadija Musayeva (ABC), Fabien Lauer (ABC), Yann Guermeur (ABC)

TL;DR
This paper derives a new risk bound for multi-category margin classifiers, improving the dependency on the number of categories by using Rademacher complexity and a novel combinatorial metric entropy bound.
Contribution
It introduces a new combinatorial metric entropy bound that enhances the theoretical understanding of generalization in multi-category margin classifiers.
Findings
Improved risk bounds with better dependency on the number of categories
Linking Rademacher complexity to metric entropy via chaining
Enhanced theoretical guarantees under minimal assumptions
Abstract
One of the main open problems in the theory of multi-category margin classification is the form of the optimal dependency of a guaranteed risk on the number C of categories, the sample size m and the margin parameter gamma. From a practical point of view, the theoretical analysis of generalization performance contributes to the development of new learning algorithms. In this paper, we focus only on the theoretical aspect of the question posed. More precisely, under minimal learnability assumptions, we derive a new risk bound for multi-category margin classifiers. We improve the dependency on C over the state of the art when the margin loss function considered satisfies the Lipschitz condition. We start with the basic supremum inequality that involves a Rademacher complexity as a capacity measure. This capacity measure is then linked to the metric entropy through the chaining method. In…
| Growth rate | Assumptions |
|---|---|
| , |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Algorithms · Face and Expression Recognition
Rademacher Complexity and Generalization Performance of Multi-category Margin Classifiers
Khadija Musayeva, Fabien Lauer and Yann Guermeur
Abstract
One of the main open problems in the theory of multi-category margin classification is the form of the optimal dependency of a guaranteed risk on the number of categories, the sample size and the margin parameter . From a practical point of view, the theoretical analysis of generalization performance contributes to the development of new learning algorithms. In this paper, we focus only on the theoretical aspect of the question posed. More precisely, under minimal learnability assumptions, we derive a new risk bound for multi-category margin classifiers. We improve the dependency on over the state of the art when the margin loss function considered satisfies the Lipschitz condition. We start with the basic supremum inequality that involves a Rademacher complexity as a capacity measure. This capacity measure is then linked to the metric entropy through the chaining method. In this context, our improvement is based on the introduction of a new combinatorial metric entropy bound.
1 Introduction
Although the theory of binary pattern classification is well established [1, 2], the theory of multi-category classification is far from being complete. The research in this case addresses problems such as the sample-complexity analysis of empirical risk minimization algorithms [3], or consistency analysis of multi-class loss functions and of specific families of classifiers [4]. Another open question is the optimal dependency of guaranteed risks of multi-category classifiers on the number of categories and the sample size . It is all the more the case for the problems that involve a large number of classes. When the considered classifiers are margin ones that take decision based on a score per category, the dependency on the margin parameter also becomes relevant to the characterization of their generalization performance. If this question has been mainly studied for specific families of classifiers, be it -nearest neighbors [5], kernel methods [6, 7] and decision trees [8], tackling it under minimal learnability assumptions remains a challenging task. This paper focuses on obtaining guaranteed risks under such assumptions.
The first step in the derivation of risk bounds is the choice of the margin loss function. Two families of margin loss functions can be distinguished: indicator margin loss functions and those that satisfy the Lipschitz condition. Deriving guaranteed risks with the optimal dependency on the parameters of interest is relatively straightforward in the first case [9]. The family of Lipschitz continuous loss functions, on the other hand, offers a richer setting to this task. In this case, one can obtain a guaranteed risk whose control term involves a Rademacher complexity [10]. Then a sequence of transitions between capacity measures is performed. More precisely, using the chaining method one can control the Rademacher complexity of a function class through the sum of its metric entropies [11]. A combinatorial bound is then used to estimate the metric entropy of the class in terms of its combinatorial dimension. In this sequence of transitions, one can choose the capacity measure at the level of which to reduce the multi-class problem to an ensemble of bi-class ones, that is, to perform a decomposition. Performing a decomposition for Rademacher complexity, a linear dependency on was obtained in [8]. This dependency has been improved to a sublinear one in [9] by postponing the decomposition to the level of metric entropy.
In this paper, we exactly follow the pathway of [9]. Our contribution is based on the following line of reasoning. Theorem 7 of [9] provides a sublinear (but still close to linear) dependency on using a decomposition result for metric entropies (Lemma 1 of [9]) in -norm with and the combinatorial metric entropy bound of [12]. On the other hand, using the decomposition result with and the -norm metric entropy bound of [13], one can obtain a radical dependency on , this, however, at the expense of a degraded dependency on . Hence, we consider the values of in between these two extreme ones, and extend the -norm bound of [12] to -norms with integer . When applied in the chaining, it results in an improved dependency on over that of Theorem 7 of [9]. Specifically, we obtain a radical dependency on (up to logarithmic factors) without worsening the dependencies on and .
The organization of the paper is as follows. In the next section, we introduce the theoretical framework and describe the transitions between the capacity measures. Then, Section 3 gives the new combinatorial metric entropy bound, whose proof can be found in A. In Section 4, we demonstrate how this result can be applied in the chaining to derive an improved upper bound on the Rademacher complexity. Conclusions and ongoing research are highlighted in Section 5. All intermediate results used in the proofs are collected in B.
Notation
We denote the set of strictly positive reals by , and let . stands for the set of integers from to . stands for the indicator function for the event such that if occurs, and [math] otherwise. is the greatest integer less than or equal to , is the smallest integer greater than or equal to .
2 Theoretical Framework
We consider -category pattern classification problems with . Each object is represented by its description and the categories belong to . We assume that and are measurable spaces. Denote by the product sigma-algebra on . We assume that the link between descriptions and categories can be characterized by an unknown probability measure on the measurable space . Let be a random pair with values in , distributed according to . The available information on is limited to an -sample distributed according to . In the following, we distinguish the sample size from the generic notation which stands for a number of points in a set that needs not be a realization of a random sample.
We consider multi-category margin classifiers that take their decisions based on a score per category and focus on those that implement classes of functions with values in a hypercube of (thus, in contrast to [7], no correlation assumption is made on the component functions). Most well-known classifiers, such as neural networks [14], support vector machines [4], and nearest neighbors [5] are margin classifiers.
Definition 1** (Multi-category margin classifiers).**
Let be a class of functions from into with . For each , is a multi-category margin classifier such that for all , , breaking ties with a dummy category .
To sidestep the complications that might arise from the measurability of a supremum of an uncountable set, we assume that the classes , and in general, all sets of functions considered in the sequel satisfy the “image admissibility Suslin” condition [15, page 101].
The classification performance of margin classifiers can be characterized based on the following functions.
Definition 2** (Class of margin functions).**
Let be as in Definition 1. For any , the margin function is
[TABLE]
Then, we define
Given , misclassifies if , or equivalently, if . The goal of the learning process is to minimize the probability of error or risk over .
Definition 3** (Risk ).**
Let be as in Definition 1. Let be the standard indicator loss function defined as
[TABLE]
For any , its risk is
[TABLE]
To make use of the values of functions (and not just of their signs) in the assessment of the classification performance, we appeal to the following margin loss function.
Definition 4** (Parameterized truncated hinge loss function ).**
For any , the parameterized truncated hinge loss function is defined as
[TABLE]
It is clear from the definition that dominates the standard indicator loss function given in Definition 3 and that it is Lipschitz continuous. Observe that when this loss function is applied to , the values of the latter strictly above and below zero become irrelevant to the estimation of the classification accuracy. Taking benefit from this fact, we introduce functions by restricting the codomain of to for all . In [8], a partial restriction is the main source of improvement upon the result of [10] in terms of the dependency on . The use of the set of functions leads to even a finer bound, this time in terms of the diameter of the function class as we switch from to .
Definition 5** (Class of truncated margin functions).**
Let be a class of functions satisfying Definition 2. Fix . For any , we define as
[TABLE]
and .
For any , its risk, can be upper bounded by the margin risk obtained on the basis of the loss function . It is the -sample based estimate of that appears in our guaranteed risk.
Definition 6** (Margin risk and empirical margin risk ).**
Let be a class of functions satisfying Definition 1. Let be as in Definition 4. Then, for , the margin risk associated with any is
[TABLE]
Its -sample based estimate is the empirical margin risk defined as
[TABLE]
In what follows, we give the definitions of the capacity measures we use and outline the transitions between them, which are at the basis of the derivation of our result. We use to denote a uniformly bounded class of functions on a generic measurable space . First, we recall the definition of the Rademacher complexity.
Definition 7** (Rademacher complexity).**
Let be a probability measure on and a sequence of independently distributed according to random variables with values in . Let be a Rademacher sequence, i.e.,a sequence of independent random variables uniformly distributed in . Then, the empirical Rademacher complexity of given is defined as
[TABLE]
and its Rademacher complexity is
The capacity measures central in the derivation of our result are covering/packing numbers. Their definitions require the introduction of the following empirical pseudo-metrics: for any and ,
[TABLE]
Definition 8** (Covering numbers, metric entropy, packing numbers).**
The -norm -covering number of , , is the smallest cardinality of the -nets of , i.e., subsets such that there exists such that . The logarithm of is the metric entropy of . A subset of is -separated with respect to if, for any two distinct elements , . The -packing number of , , is the maximal cardinality of its -separated subsets. The uniform covering and packing numbers are
[TABLE]
and
[TABLE]
respectively.
The capacity measures appearing last in our bounds are combinatorial dimensions. They provide useful information about whether the class of interest uniformly satisfies the classical limit theorems [16].
Definition 9** (Fat-shattering dimension [17], strong dimension [13]).**
For , a subset of is said to be -shattered by if there is a function such that, for every vector , there is a function satisfying
[TABLE]
The fat-shattering dimension of at scale , , is the maximal cardinality of a subset of -shattered by , if such a maximum exists. Otherwise, . For a class of integer valued functions, the notion of strong dimension, , is obtained from the definition of the fat-shattering dimension by setting and restricting the co-domain of to .
As in [9, 18, 19], we make the hypothesis that the fat-shattering dimensions of the classes , , grow no faster than polynomially with .
Hypothesis 1**.**
Let be a class of functions satisfying Definition 1. We assume that there exists a pair such that
[TABLE]
Among the well-known examples of classifiers that satisfy such an assumption are support vector machines with (Theorem 4.6 in [20]) and feedforward neural networks with for layers (Corollary 27 in [2]). It should be noted that Lipschitz classifiers, such as nearest neighbours also satisfy this assumption as demonstrated by Corollary 4 in [21]. Depending on the growth rate , our assumptions regarding the data are summarized in Table 1.
Our starting point is the following basic supremum inequality that bounds the risk by the empirical margin risk plus a control term based on a Rademacher complexity.
Theorem 1** (Theorem 5 in [9]).**
Let be a class of functions satisfying Definition 1. For , let be the class of functions deduced from according to Definition 2. For fixed and , with probability at least ,
[TABLE]
We perform the following sequence of transitions between the capacity measures to derive our result. First, we relate the empirical Rademacher complexity of to its metric entropy through the chaining method (see [11]). More precisely, we use the following formulation of the chaining bound due to [9]:
[TABLE]
where and is a decreasing function satifying . Next, using Lemma 1 in [9], we decompose the metric entropy of in terms of the ones of the classes :
[TABLE]
where . Finally, our combinatorial bound derived below gives an estimate on the metric entropies of the classes in terms of their fat-shattering dimensions.
3 -norm Combinatorial Metric Entropy Bound
We extend the -norm metric entropy bound of [12] to -norms with . The bound of [12] does not depend on the sample size thanks to the use of the probabilistic extraction principle. In our extension we derive two bounds. In one of them, we keep the dependency on the sample size, and in the other, we remove it using the -norm generalization of the aforementioned principle. Under Hypothesis 1, depending on the value of , the application of one or the other bound in the chaining allows us to optimize the dependency on while not degrading the ones on and , as will be seen in Section 4.
Specifically, we have the following -norm metric entropy bounds, whose proof is given in A.
Theorem 2**.**
Let be a class of functions from into with . For , let . For all values of and ,
(a) if , then
[TABLE]
(b) if , then
[TABLE]
From (2) one can see that, based on , the dependency on in the scale of covering numbers can be eliminated for all . The combination of the decomposition formula (2) with Theorem 2 using for yields the following result.
Corollary 1**.**
Let be a class of functions as in Definition 1. For , let be the class of functions deduced from according to Definition 5. For , let . Then, for and ,
[TABLE]
and
[TABLE]
Proof.
Inequality (3) follows from the application of (2) and part (a) of Theorem 2 (where we drop from the denominator inside the logarithm as it is greater than one), along with the fact that and . We obtain Inequality (4) in a similar way using part (b) of Theorem 2 instead. ∎
4 Bound on the Rademacher complexity
As it was noted in [18], under Hypothesis 1, the growth rate of the fat-shattering dimension has a dramatic effect on the behavior of the Rademacher complexity of the function class. The availability of two kinds of metric entropy bounds allows us to adapt to this impact in the chaining so as to optimize the dependency on without worsening those on and . Under the aforementioned hypothesis, two cases can be distinguished. For , the formula (1) can be upper bounded by an integral and the use of the dimension-free bound (4) leads to the optimized result. For , such a result is obtained from the application of (3) in (1). The second case can also be characterized by the fact that there is a freedom in the choice of the number of steps to construct the chaining. To optimize this construction when , we make the non-restrictive assumption that is greater than a small power of .
Theorem 3**.**
Let be a class of functions as in Definition 1. For , let be the class of functions deduced from according to Definition 2. Then, under Hypothesis 1, there is a function such that for all ,
[TABLE]
Compared to Theorem 7 of [9], one can see that in all three cases, the dependency on is improved: the powers of are replaced by powers of without losing in the dependencies on and . It is interesting to note that, in the third case, when , which is true for instance for feedforward neural networks (see Corollary 27 in [2]), the dependency on is slightly better than radical. This is, however, at the cost of the constant factor .
Proof of Theorem 3.
For all , we set with for all in (1). In the following, we use the relation
[TABLE]
which follows directly from the fact that
[TABLE]
First case:
This is the only case where Pollard’s entropy condition [16] is satisfied. For this case we could directly use Dudley’s integral formula (Formula 33 in [9]), however, to optimize with respect to constants, we start from (1) and upper bound it by an integral in the following way.
Apply (5) and (4) in sequence to the right-hand side of (1) and use Hypothesis 1 to get
[TABLE]
Letting , we obtain
[TABLE]
Taking , we can upper bound the last expression as
[TABLE]
Denote and let us now compute the integral
[TABLE]
Set . Then,
[TABLE]
Applying the integration by parts formula, we obtain
[TABLE]
Consequently,
[TABLE]
Second case:
In this case, we apply (5) and (3) to (1) and use Hypothesis 1 to get
[TABLE]
Unlike the first case, we now control the number of steps in (6) through the parameters of interest, and . The aim is to optimize the dependencies with respect to them while making sure that (i) is a strictly positive integer and (ii) as , .
Now, if , set . Thus, from (6), we have
[TABLE]
Setting and bounding the series, we obtain
[TABLE]
For the final case, , we set in (6) and bound the geometric series:
[TABLE]
Now, let . Note that, with the assumption , for all and thus, is a strictly positive integer. Applying it to (7), we get
[TABLE]
∎
5 Conclusions
We derived a sharper risk bound for multi-category margin classifiers following the pathway of [9]. In this pathway, the first capacity measure that appears in the control term of the guaranteed risk is a Rademacher complexity. It is then related to the metric entropy through the chaining method. Using a decomposition for metric entropy, we transition from the multi-class setting to the bi-class one. Finally, a combinatorial bound gives an estimate on the metric entropy in terms of the combinatorial dimension. The metric entropy bound used in [9] is the -norm one of [12], which in this paper we generalized to -norms with integer . This generalization resulted in an improved dependency on the number of categories compared to [9] without worsening the dependency on the sample size nor the one on the margin parameter .
So far, to get an explicit dependency on under minimal learnability assumptions, a transition from the multi-class case to the bi-class one has been been performed at the level of one of two capacity measures. Realizing it at the level of a Rademacher complexity, a linear dependency on was obtained in [8]. In this paper, as in [9], we showed that postponing it to the level of metric entropy, this dependency can be improved to a sublinear one. The case that remains to be studied is a decomposition at the level of a combinatorial dimension, more precisely, at that of the fat-shattering dimension. The goal is to complete the picture of the impact that performing a decomposition at the level of one of three different capacity measures has on the dependencies on , and .
Appendix A Proof of Theorem 2
Let and . Let be an -separated with respect to the pseudo-metric subset of of maximal cardinality. By definition, , where denotes the class whose domain is restricted to . We distinguish three major steps in the proof: i) discretize functions in the set , ii) demonstrate that the set of discretized functions is separated, and iii) upper bound the cardinality of the discretized set. The purpose of discretizing the set of real-valued functions is to reduce the original problem into the one that can be addressed by combinatorial means: we upper bound the packing number of the discretized set which is then related to that of the original set via the step (ii).
(a) Let , and . Define the class of functions from into obtained by the discretization of functions in in the following way:
[TABLE]
We claim that with such a discretization, for any , . Using for all ,
[TABLE]
where denotes the set of indices such that , for all . Next, by the inverse triangle inequality, for all , the right-hand side of the above inequality can be bounded as
[TABLE]
Let denote the complement of . Now, by definition of ,
[TABLE]
It follows that
[TABLE]
Applying the last inequality to (8) and using with and (where we set and ), we get
[TABLE]
This proves our claim. Then, it follows that
[TABLE]
The major step that remains to perform to arrive at the claimed bound is to upper bound the right-hand side of (9). To this end, we appeal to Proposition 3. Let be the strong dimension of . By part (1) of Lemma 3.2 in [13],
[TABLE]
By Lemma 1 and the fact that , on the other hand, we have
[TABLE]
We can plug this result in the upper bound on based on the fact that the fat-shattering dimension decreases with the scale:
[TABLE]
Now, according to Proposition 3,
[TABLE]
Applying Lemma 1 to the right-hand side of (10) and simplifying it we get
[TABLE]
We apply the relation (9) and the following well-known inequality [23]
[TABLE]
in sequence to the left-hand side of (11). Finally, to obtain the claimed result, we take supremum over of both sides of the obtained bound.
(b) To derive a dimension-free combinatorial bound we use the -norm generalization of probabilistic extraction principle: Lemma 8 of [9]. According to this lemma, there exists a subset of of cardinality
[TABLE]
such that is -separated with respect to , with . Let denote the class whose domain is restricted to . We have
[TABLE]
We let and discretize the functions in the set in a similar way as in part (a):
[TABLE]
Applying the same procedure as in the proof of part (a), we obtain that for any , , and hence
[TABLE]
By Proposition 3,
[TABLE]
where is the strong dimension of . Plugging the value of and performing similar computations as in Inequalties (10)-(11) of part (a), we get
[TABLE]
Now, we go back from the discretized set to using the relations (14) and (15) which yield: . Using it and Inequality (13) in (16) give:
[TABLE]
Now, based on and by a straightforward computation,
[TABLE]
Next, we bound using part (1) of Lemma 3.2 in [13] and Lemma 1:
[TABLE]
Plugging this into (17) and applying Lemma 1 to , we obtain
[TABLE]
The claim follows from the application of , Inequality (12) and taking supremum over of both sides of the obtained bound.
Appendix B Technical Results
Lemma 1**.**
For all ,
[TABLE]
Proof.
By Formula (8.5) in [24, page 119],
[TABLE]
where is an Eulerian polynomial in of degree with (see page 116 in [24] for explicit form of this polynomial for smaller values of ). Thus for ,
[TABLE]
We now show by induction that for all , . By definition,
[TABLE]
For the base case, , it is easily seen that . Now, assume for , . Then,
[TABLE]
We have that
[TABLE]
Applying it in (18), we obtain
[TABLE]
Now, by the binomial theorem, for all ,
[TABLE]
Consequently,
[TABLE]
where we used the convention that . ∎
The results demonstrated hereafter are the generalizations of those in [12]. In the following, we denote with .
Lemma 2** (After Lemma 5 of [12]).**
Let be a bounded random variable. Let . Then, there exist numbers and , such that
[TABLE]
or vice versa.
Proof.
The proof closely follows that of Lemma 5 of [12] where the variance of is replaced by its higher moments.
Divide into the intervals of length with
[TABLE]
by setting
[TABLE]
Assume the lemma does not hold and let be a non-increasing sequence of non-negative numbers such that
[TABLE]
and
[TABLE]
For the conclusion of the lemma to fail it should hold that
[TABLE]
Now, assume that for some , and consider intervals
[TABLE]
and . Then,
[TABLE]
and
[TABLE]
By definition of and by our assumption, , which means that . Now, let be the middle point between the intervals and and let . We have that
[TABLE]
and
[TABLE]
Thus, the lemma holds. This proves (19). Now, by induction from (19) we get that
[TABLE]
We use it in the computation of . By definition,
[TABLE]
By construction, whenever , Thus,
[TABLE]
By a similar procedure, it can be proved that
[TABLE]
This produces a contradiction proving the lemma. ∎
In the following, is a finite set and .
Lemma 3** (After Lemma 6 of [12]).**
Let be a finite class of functions from into with and . Assume that for some , is -separated in the pseudo-metric . Then there exist , and such that
[TABLE]
with and or vice versa.
Proof.
can be viewed as a finite probability space with a uniform probability measure for any . Then, for any two random elements selected independently according to ,
[TABLE]
By the Minkowski inequality, for any ,
[TABLE]
Taking it into account in the formula above, we obtain,
[TABLE]
Now, the event that the realizations of and are different elements in happens with probability . Then, by the separation assumption on we have
[TABLE]
Thus,
[TABLE]
It means that there exists , such that
[TABLE]
Next, we apply Lemma 2 to the random element and take into account that
[TABLE]
and that
[TABLE]
Then, it follows that
[TABLE]
and, similarly,
[TABLE]
Finally, the claim follows from the definition of . ∎
The results given in the sequel call for the introduction of the definition of the -separating tree.
Definition 10**.**
Let be a class of functions on . A tree is a finite collection of subsets of , such that its any two elements are either disjoint or one of them contains the other. A son of is its maximal (with respect to inclusion) proper subset. An element of with no sons is called a leaf. Let . If every which is not a leaf has exactly two sons and
[TABLE]
then is an -separating tree.
Proposition 1** (After Proposition 8 in [12]).**
Let be a finite class of functions from into with . Assume that for some , is -separated in the pseudo-metric . Then, there is a -separating tree of with at least leaves.
Proof.
By Lemma 3, has two subsets and such that
[TABLE]
which implies
[TABLE]
The rest of the proof is based on induction on the cardinality of and is exactly as in [12], except that the tree is now -separated. ∎
Proposition 2** (After Proposition 10 in [12]).**
Let be a class of functions from into a finite set of integers. Let and let . The number of pairs strongly shattered by is at least the number of leaves in any -separating tree of .
Proof.
The proof follows exactly the one of Proposition 10 in [12], with a few minor technical changes. Let be a node in a -separating tree of . Let denote the number of pairs strongly shattered by a set . For the proof it suffices to show that if and are two sons of , then
[TABLE]
By definition of the -separating tree, there exists such that
[TABLE]
It follows that
[TABLE]
If a pair is strongly shattered either by or , then it is also strongly shattered by . On the other hand, if a pair is strongly shattered both by and , then . Otherwise, there would exist satisfying and . Combining it with (21) yields a contradiction:
[TABLE]
Now, consider a pair , where for all and . This pair is shattered by , but neither by or . As is shattered both by and , then from (21) it follows that,
[TABLE]
similarly,
[TABLE]
It proves the claim that shatters the pair . Therefore, in both cases we get (20). ∎
The next result is obtained by combining Propositions 1 and 2.
Corollary 2** (After Corollary 11 in [12]).**
Let be a class of functions from into a finite set of integers. Let and let . If is -separated in the pseudo-metric , then it strongly shatters at least pairs .
Proposition 3** (After Proposition 12 in [12]).**
Let be a class of functions from into . Let . Assume is -separated in the pseudo-metric . Then for any ,
[TABLE]
Proof.
By Corollary 2, strongly shatters at least pairs . On the other hand, the total number of such pairs for which the cardinality of is at most is bounded above by
[TABLE]
To see this, note that there are at most number of sets of size and for each such the number of functions is bounded above by . Therefore,
[TABLE]
The proof is completed by bounding the right-hand side of the above inequality in a standard way as follows:
[TABLE]
where we used the convention that for all , . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] V. Vapnik, Statistical Learning Theory, John Wiley & Sons, Inc., New York, 1998.
- 2[2] P. Bartlett, The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network, IEEE Transactions on Information Theory 44 (2) (1998) 525–536.
- 3[3] A. Daniely, S. Sabato, S. Ben-David, S. Shalev-Shwartz, Multiclass learnability and the ERM principle, in: COLT’11, 2011, pp. 207–232.
- 4[4] Ü. Doğan, T. Glasmachers, C. Igel, A unified view on multi-class support vector classification, Journal of Machine Learning Research 17(45) (2016) 1–32.
- 5[5] A. Kontorovich, R. Weiss, Maximum margin muliclass nearest neighbors, in: ICML’14, 2014.
- 6[6] T. Zhang, Statistical analysis of some multi-category large margin classification methods, Journal of Machine Learning Research 5 (2004) 1225–1251.
- 7[7] Y. Lei, Ü. Doğan, A. Binder, M. Kloft, Multi-class SV Ms: From tighter data-dependent generalization bounds to novel algorithms, in: NIPS 28, 2015, pp. 2026–2034.
- 8[8] V. Kuznetsov, M. Mohri, U. Syed, Multi-class deep boosting, in: NIPS 27, 2014, pp. 2501–2509.
