The Geometry of Mixability
Armando J. Cabrera Pacheco, Robert C. Williamson

TL;DR
This paper offers a geometric perspective on mixable loss functions, characterizing their properties through differential geometry and superprediction sets, which unifies binary and multi-class cases.
Contribution
It introduces a geometric characterization of mixability for proper loss functions using superprediction sets, providing a coordinate-free framework that unifies binary and multi-class scenarios.
Findings
Superprediction sets slide freely inside the log loss superprediction set for mixability.
The geometric approach applies under general differentiability assumptions.
Reconciliation of previous results for binary and multi-class cases.
Abstract
Mixable loss functions are of fundamental importance in the context of prediction with expert advice in the online setting since they characterize fast learning rates. By re-interpreting properness from the point of view of differential geometry, we provide a simple geometric characterization of mixability for the binary and multi-class cases: a proper loss function is -mixable if and only if the superpredition set of the scaled loss function slides freely inside the superprediction set of the log loss , under fairly general assumptions on the differentiability of . Our approach provides a way to treat some concepts concerning loss functions (like properness) in a ''coordinate-free'' manner and reconciles previous results obtained for mixable loss functions for the binary and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Immune Systems Applications
The Geometry of Mixability
Armando J. Cabrera Pacheco
Universtät Tübingen, Tübingen AI Center
and
Robert C. Williamson
Universtät Tübingen, Tübingen AI Center
Abstract.
Mixable loss functions are of fundamental importance in the context of prediction with expert advice in the online setting since they characterize fast learning rates. By re-interpreting properness from the point of view of differential geometry, we provide a simple geometric characterization of mixability for the binary and multi-class cases: a proper loss function is -mixable if and only if the superpredition set of the scaled loss function slides freely inside the superprediction set of the log loss , under fairly general assumptions on the differentiability of . Our approach provides a way to treat some concepts concerning loss functions (like properness) in a “coordinate-free” manner and reconciles previous results obtained for mixable loss functions for the binary and the multi-class cases.
1. Introduction
In the context of prediction with expert advice as described by Vovk in [Vov98] and [Vov01], an information game is considered between three players: the learner, experts and nature. At each step ,
- •
each expert makes a prediction which the learner is allowed to see,
- •
the learner makes a prediction,
- •
nature chooses an outcome,
- •
for a fixed loss function , the cumulative loss is calculated for the learner and each of the experts.
The goal is to minimize the difference between the learner’s loss and the best expert’s loss, which is often called the regret.
1.1. Mixable games and characterizations of mixable and fundamental loss functions
For a wide class of games, called -mixable games for , the Aggregating algorithm (see for example [Vov01]) ensures an optimal bound for the regret () independent of the trial . Since the mixability of a game depends on the loss function , a loss function is -mixable if the corresponding game is mixable. Since arguably the aggregating algorithm is one of the most well founded and studied prediction algorithms, there is a natural interest in understanding properties and characterizations of mixable loss functions.
Examples of mixable loss functions include the log loss, relative entropy for binary outcomes [HKW98] and the Brier score [VZ09, vERW12]. Mixability of a loss function is characterized by a “stronger convexity” of the superprediction set of , which can be described as the convexity of the superprediction set of after an “exponential projection” (see \[email protected] below and [Vov15] and [vERW12]). Unfortunately, this characterization of mixability lacks a transparent geometric interpretation.
The main goal of this work is to provide such geometric interpretation. The motivation stems from an observation made by Vovk in [Vov15]: a -mixable loss can be characterized as the positiveness of the infimum of the quotient of the curvatures of the a strictly proper loss function and the log loss for binary outcomes. Here as usual, loss functions are defined on the 2-simplex (see \[email protected]). Moreover, he then proves that fundamentality (see Vovk [Vov15]) of a loss can be characterized as the finiteness of the supremum of the same quotient of curvatures. These two results suggest that these properties are geometric, meaning that they can be studied using differential geometry tools, and in this regard, mixability and fundamentality should not depend on the coordinates chosen to express them.
Loosely speaking, in convex geometry a convex set is said to slide freely inside a convex set , if for any point in the boundary of , there is a translation vector such that the translation of by (i.e., the Minkowski sum , see \[email protected]), intersects at , and . We provide the following geometric characterization of mixability and fundamentality, as a geometric comparison to the log loss (see Figure 1). Let denote the superprediction set of a loss function (see \[email protected]).
Theorem 1.1** (Informal statement).**
A continuously twice differentiable proper loss function is -mixable if and only there is such that slides freely inside . In addition, the same is fundamental if and only if there exists such that slides freely inside .
To obtain the previous theorem it is necessary to re-interpret properness from a differential geometry point of view, which constitutes a big part of this work. However, this technical effort pays off. In [vERW12], van Erven, Reid and Williamson characterized -mixable (differentiable) loss functions for multi-class loss functions and moreover, related to the Hessian of the Bayes risk of and the log loss (see Definition 1.3), which is interpreted as its curvature. By generalizing the tools developed here for the binary case, we were able to obtain a multi-class analog result to Theorem 1.1 and to build a bridge to the results in [vERW12].
1.2. Description of results and structure of the article
Using the same setting as [vERW12], we obtain a geometric characterization of -mixable loss functions in the sense of differential geometry. Loss functions are considered to be maps , which under the conditions assumed in this work, give rise to submanifolds of whose geometric properties are determined by (see the relevant precise definitions below). We first discuss the case (binary classification loss functions) since it is more instructive, and then the case . We summarize the main results as follows.
- (1)
We recast the notion of a (strictly) proper loss as a geometric property of the loss itself rather than its superprediction set. That is, properness is no longer considered a parametrization dependent property, it is a statement about the geometric properties of the “loss surface” (the boundary of the superprediction set). See lemmas 2.7 and 3.2. 2. (2)
A geometric comparison is performed. For in terms of the curvature of the “loss curves” (see Section 1.5 below), and for in terms of the scalar second fundamental form of the “loss surfaces” (see Section 3 and Appendix A), which measure how they curve inside . The precise statements are given in Lemma 2.13 and Lemma 3.6. Intuitively, these results tell us how the superprediction set of sits inside the superprediction set of the log loss. 3. (3)
Finally, we interpret our result from the point of view of convex analysis to give a new characterization of mixability. More precisely, We show that a (strictly) proper loss function is -mixable if and only if the superprediction set of slides freely (see Definition 4.11) inside the superprediction set of the log loss.
As byproducts, we obtain a general way to define mixability with respect to a fixed (strictly) proper loss function, further properties and consequences for binary classification loss functions, particularly for composite losses and canonical links, and a bridge to the results obtained in [vERW12].
Since we treat loss functions from the point of view of differential geometry and convex geometry, a considerable background in these topics is needed. We present this work as self-contained as possible and spend some time providing the intuition and motivation for the results (and sometimes the background) which naturally results in a longer exposition. In Section 2 we treat the binary case, in Section 3 the multi-class case to obtain the geometric interpretation of properness and mixability and perform the geometric comparison (in terms of curvature). In Section 4 we make the connections to convex geometry and obtain the geometric characterization of mixability in terms of the sliding freely conditions of superprediction sets.
1.3. Setup
Here we summarize our setup, for more details see [vERW12]. Denote by the set of natural numbers . The set of probability distributions on a finite set with is given by
[TABLE]
We note that is a manifold with (non-smooth) boundary of dimension . Moreover, is a hypersurface in ; we denote the interior (as a manifold) of as which is the same set as the relative interior of . We define the standard parametrization of as the map given by
[TABLE]
In particular, when the standard parametrization of is the map given by .
Definition 1.2**.**
A loss function is a map such that for each , the map is continuous.
Given a loss function , and , the value represents the penalty of predicting upon observing . We define the partial losses of a loss function as the maps given by . A loss function can be described in terms of its partial losses as
[TABLE]
Thus, we can identify a loss fuction with the map determined by its partial losses
[TABLE]
In this work we follow this convention unless stated otherwise. Note that this way we can see a loss function as an embedding of into (assuming enough properties on ). We will see later that properness ensures the image of this embedding to be a nice hypersurface of with appealing geometric properties. Under the assumption that the outcomes are distributed with probability , we make the below definitions following [vERW12, RW10].
Definition 1.3**.**
Given a loss function , we define the conditional risk as the map as
[TABLE]
and the associated conditional Bayes risk as the map given by
[TABLE]
Definition 1.4**.**
A loss function is said to be proper if for any
[TABLE]
for all . In other words, has a minimum at . When is the only minimum of we say that is strictly proper.
For our geometric considerations it will be useful to denote the image of under by , and impose enough differentiability conditions on so that is (at least) a -manifold. See Definitions 2.1 and 3.1 below.
We now recall the definition of mixability (see for example, Vovk [Vov15, vERW12]). For , let be the -exponential projection defined as
[TABLE]
A loss function is called -mixable if the image of its superprediction set, , given by
[TABLE]
is convex under the -exponential projection, that is is convex. We say that is mixable if is -mixable for some .
Definition 1.5**.**
Let be a mixable loss function. The mixability constant of , , is defined as
[TABLE]
1.4. Motivation
In this part we mainly discuss the case since it is more illustrative. It has been made evident that there is a strong relation between properness and mixability. Here we make this relation more explicit and transparent from a geometric point of view. The basic motivation is as follows. It is commonly understood that properness is a property that depends on the parametrization of the boundary of the superprediction set of [Vov15]. It has been also shown that it is related to the “curvature” of the Bayes risk, since it requires that the superprediction set remains convex under the -exponential projection given by \[email protected] (with the standard parametrization of the simplex ) [BSS05, RW10, vERW12]. Mixability is considered to be a stronger notion of convexity [Vov15], for some . The basic observation in this work is that it is possible recast properness from a geometric point of view, i.e., independent of the parametrization of . More precisely, we define properness in terms of the loss function viewed as a map rather than in terms of the superprediction set (as it is usually defined). More precisely, to determine whether a given is proper or not, it is not enough to look at image (as the boundary of ) but rather how is mapped into by — since we will be using tools of differential geometry, we will assume differentiability (see Section 2). More precisely, restricting first to (see Lemma 2.7 below), a given loss function will be (strictly) proper if and only if
- (1)
the normal vector to at is equal to for all , and 2. (2)
the curvature (see Section 1.5 below) at any point with respect to the unit normal vector is strictly positive for all .
As observed in Figure 2, , which implies that their boundaries coincide (as a set). In particular, this implies that it is possible to “parametrize” the boundary of , , in the same way as in order to have a proper loss. However, note that this changes the map and hence from the point of view of this work, this is a different loss function. In practice, one is given a loss function rather than a superprediction set , therefore we look at losses as individual maps from to instead of looking at their superpredictions sets and obtaining a proper loss by choosing a convenient parametrization of .
Remark 1.6**.**
Our strength by characterizing proper loss functions in this way is that we will be able to apply techniques from differential geometry, however, these considerations only work for loss functions which are sufficiently differentiable. For a general set up, it is possible to characterize properness of a loss function in a fairly simple way via the convexity of its superprediction set. More precisely, the “loss surface” is the subgradient of the support function of the superprediction set. This was thoroughly studied by Williamson and Cranko in [WC22]. We briefly explore some connections to our work in Section 4. Alternative approaches to extending and better understanding mixability include [RFWM15] and [MW18].
1.5. Comments about the curvature of planar curves
The second condition for to be proper mentioned above involves a condition on the curvature of . We now make this notion precise. Recall that if is a curve with for all in its domain, then its curvature can be seen a measurement of the variation of its unit normal vector at each point. We define the canonical normal vector at , , as the unit normal vector in the direction obtained by rotating counterclockwise. Then, the signed curvature of at is defined as
[TABLE]
The interpretation of this number is as follows: is positive if “curves” in the direction of . However, note that at each point we have two normal vectors: . Thus, and depend on the direction of (i.e., ), and their values differ by a negative sign. Thus, we can talk about the curvature of with respect to a chosen unit vector (either choosing or for all points, assuming this is possible, which is the case for the curves we will consider here, see Figure 3) and denote it by . In the case when , then , and when , then . Since is invariant under reparametrizations (up to a sign), we can simply talk at the curvature of at a given point in the image of . In Section 2 we make precise our choice in (2) above. For a summary of geometry of curves see Appendix A.
Going back to loss functions, suppose is a loss function. Since is a -manifold, any parametrization around a point (of its interior) can be assumed to be of the form for some . Thus, the local expression of under this parametrization is a curve in . By changing around the same point, we are reparametrizing . Since curvature is independent of coordinates (i.e., of the used) up to a sign, we can define the curvature of the loss curve with respect to a chosen unit normal vector (which will depend only on ). To compute it from its definition in \[email protected], we need to choose a parametrization , and as we will see, many times it is convenient to take .
Remark 1.7**.**
One could avoid part of the technical complications above by choosing beforehand , as it is usually implicitly done, and then requiring and to be monotone (cf. [BSS05, RW10, SAM66, Vov15]) – essentially, this amounts to choosing “direction” for the admissible loss curves. Although this approach is appealing since the curve parameter ( in our case) can be directly interpreted as a probability, and moreover it simplifies calculations since in this case the convention can be chosen so that the signed curvature coincide with (see for example [Vov15]), when considering the multi-class case, the notion of “direction” breaks down and it is not clear which properties of one should consider. The approach we consider here gives a concrete logical path to a generalization to the multi-class case (see Section 3).
1.6. Reconciling this point of view with previous works
In this part we explain how to “translate” the results we obtain here to previous results regarding proper losses and mixability. We do this in particular with [RW10] and [Vov15].
- •
Reid–Williamson [RW10]. Let . The parameter in [RW10] corresponds to the parameter here, and correspond to and , respectively. Although the regularity assumption in [RW10] is initially only differentiability of the partial losses, when discussing the weight of a loss function they impose regularity. From Theorem 1 in [RW10], we see that a loss is proper if (in particular) and . We can heuristically say that goes from “right” to “left”. This means that in this case, . The log loss in this case is .
- •
Vovk [Vov15]. In [Vov15] the loss functions are defined as maps , with increasing and decreasing (infinite differentiable). In this case, heuristically, losses go from “left” to “right” so that . To relate this convention to ours, we set . Then the parameter in [Vov15] corresponds to and and correspond to and . The log loss is then given by .
Therefore, from our point of view, in previous works there is an implicit choice of a parametrization of , particularly motivated to interpret the parameter as a probability. However, it is well known that sometimes this might not be the case and a link function is needed [RW10] – this fits well with our approach as a link function for us is a different choice of parametrization; this will carefully explained in Section 2.7. In favor of the study of loss functions using tools from differential geometry we are then motivated to eliminate this choice of parametrization and consider as a map between manifolds (namely, and as a submanifold of ). Although picking a general parametrization of complicates the interpretation of the parameter, it makes other properties of loss functions transparent. This approach has, to the knowledge of the authors, never been explored. We remark that, however, one can always set and reinterpret the results of this work as the parameter being a probability. With this geometric characterization of loss functions and properness at hand we continue to study mixability.
2. Properness and Mixability for Binary Classification
We first restrict our discussion to binary classification, i.e., setting . Thus, we consider maps , where , with partial losses and . In this case the standard parametrization of is given by for . When a parametrization of , say , is chosen, then the local expression of with respect to () is a map from some interval to , that is, a curve in the plane .
Dating back to [HKW95, Vov98] it has been established that properness of a loss function imposes strong conditions on the first and second derivatives of their partial losses. In [Vov15] these relations were expressed by means of the curvature of the loss curve. Moreover, in [BSS05, RW10] properness is related to the second derivative of its Bayes risk, which in a way can be interpreted as its curvature. However, in these works there is always an implicit choice of parametrization of , which in turn imposes certain restrictions on the “admissible” loss functions, particularly making the results parametrization dependent. In this section, we first recast properness as a geometric property which allows us to obtain results in a parametrization (or coordinate) independent way.
Definition 2.1**.**
An admissible loss function is a map such that
- (i)
* is a -manifold of class ,* 2. (ii)
there exists a differentiable map , , where is the normal space of , and 3. (iii)
* or belongs to for all .*
We denote the set of admissible loss functions as .
Remark 2.2**.**
We give the following interpretation of the previous definition. (i) simply says that the loss curve (once parametrized) is twice differentiable with continuous second partial derivatives. (ii) prevents some “anomalies” on , for example, can not be constant on a neighborhood of a point. (iii) defines a subfamily of loss curves which are not allowed to vary “too much”. This definition should be compared to the definition of loss functions in Section 2 in [Vov15].
Definition 2.3**.**
Let . Let be the map that assigns to each the normal vector to at that lies in . We denote by the signed curvature of with respect to the unit normal belonging to . We refer to as the curvature with respect to the unit normal vector pointing towards .
2.1. Proper losses
Lemma 2.4**.**
Suppose that in is strictly proper, then the signed curvature of the loss curve has a sign. Moreover, its curvature, , is positive with respect to unit normal vector (field) pointing towards .
Proof.
Let and let be a parametrization of around , for some , which we use to obtain a parametrization of around 111Notice that this particular choice of coordinates around suffices since we want to conclude something about the curvature of the curve loss .. We consider the local expression of given by
[TABLE]
Using strict properness we know that fixing , the function achieves a minimum at (and it is the only one), that is
[TABLE]
To compute the sign of the signed curvature of it is enough to determine the sign of . Without loss of generality, assuming on this coordinate neighborhood we can write
[TABLE]
where we have used \[email protected] and \[email protected]. Notice that if for some then necessarily by \[email protected], which is impossible in . Therefore has a sign and this sign determines the sign of the signed curvature of .
For the second statement, notice that again using \[email protected] we know that and have different signs (and they do not change). If , then that means that the first coordinate increases and the second decreases, hence points towards and . If , then we are in the opposite case and in this case points to and , thus the signed curvature with respect to (the unit normal pointing towards ) is positive. ∎
From the proof of the previous theorem we obtain the following corollary.
Corollary 2.5**.**
Let . If is proper, then is normal to the loss curve at .
Proof.
It follows directly from \[email protected], since for fixed , attains a minimum at . ∎
Lemma 2.6**.**
In , proper implies strictly proper.
Proof.
Let , and suppose that there is in , such that
[TABLE]
Using \[email protected], we see that is normal to at , and hence and are parallel. Since both belong to , it follows that , which is a contradiction. ∎
Therefore, in what follows (as long as we stay within ) we will use proper and strictly proper interchangeably.
Note that the converse of Lemma 2.4 does not hold. That is, there are which have positive signed curvature (with respect to the unit normal pointing towards ), but are not proper. Indeed let be defined as
[TABLE]
Taking the (standard) parametrization we see that the loss curve goes from left to right so points towards . Moreover, we can readily see that the (signed) curvature is positive. However, is not normal to at , thus by Corollary 2.5, can not be proper.
Therefore, we obtain the following characterization of proper losses in .
Lemma 2.7**.**
Let . is strictly proper if and only if is normal to the loss curve at for all and the signed curvature of with respect to the normal vector pointing towards is positive at all points for .
Proof.
The “if” part is Lemma 2.4. For the “only if” part, let be such that
[TABLE]
where is the signed curvature of with respect to the unit normal pointing towards . Let and let be a parametrization around . We readily see that \[email protected] implies that
[TABLE]
while \[email protected] implies by the proof of Lemma 2.4. This implies that fixing , achieves its minimum at . Then is proper and by Lemma 2.6, we conclude it is strictly proper. ∎
Remark 2.8**.**
Notice that to check whether a given loss function is proper or not, it suffices to do it in any coordinate system. That is, given , we check conditions \[email protected] and \[email protected] for .
2.2. Mixable loss functions
We say that a loss function is fair if as and as (this is motivated by the interpretation when using the standard parametrization, see [RW10]). In addition, recall that a loss function is proper if and only if
- (i)
can be chosen, and 2. (ii)
for all .
Thus, a prototype of a fair proper loss function is shown in Figure 4.
Recall from Section 1 that mixability is defined in terms of the superprediction set of . More precisely, for , consider the set
[TABLE]
where is the exponential projection \[email protected]. Then, is -mixable if and only if is convex.
Remark 2.9**.**
We stress the fact that this definition depends on the superprediction set of rather than on itself – two different loss functions with the same superprediction set will be equally mixable. From our perspective, when talking about mixability of the map (i.e., without making reference to the superprediction set), we see that we can define it as follows. A loss is mixable if the 1-dimensional manifold has signed curvature . We will adopt the latter version here. Although clearly these definitions are equivalent, it is useful to have this at hand to relate mixability with properness. For now on, when we say is mixable we mean in the latter way. See Figure 5.
We close this part by describing the log loss, which will play an important role. Let , given by
[TABLE]
Let . Then
[TABLE]
Since , its canonical normal vector is
[TABLE]
The curvature with respect to , the normal vector pointing towards , is then given by
[TABLE]
When there is no risk of confusion with denote simply as .
2.3. Mixability and curvature
Haussler, Kivinen and Warmuth in [HKW95] gave a characterization of the mixability constant of a mixable proper binary loss function in terms of the first and second derivatives of its partial losses. We reprove this characterization from a geometric point of view, that is, independent of the parametrization chosen for .
Let be proper and a 1-chart parametrization222This means that the map is such that . of , then will be convex if and only if the curve has negative curvature with respect to the unit normal pointing towards . Since is proper we can assume without loss of generality that . We are then interested in computing the signed curvature of
[TABLE]
and showing that . We have
[TABLE]
and
[TABLE]
and thus we have
[TABLE]
Note that the sign of is the sign of . If is positive, then one can check that and , thus the first term in brackets is necessarily negative. Thus by making large will become negative. Then we want
[TABLE]
that is,
[TABLE]
When considering the case when the signed curvature is negative, we have:
Lemma 2.10**.**
Suppose that is a proper loss function. Then, if is mixable, for any 1-chart parametrization of , the mixability constant is given by
[TABLE]
Conversely, if \[email protected] holds, then is mixable with mixability constant .
By the local nature of curvature, it would be possible to consider a “local version” of Lemma 2.10, which would characterize a “local” notion of mixability. This alternative will not be pursued here.
In [Vov15], Vovk observes that mixability for proper losses is equivalent to a quotient of curvatures being bounded away from zero. For the reader’s convenience we prove this statement. To recover Vovk’s statement observe that the properties he imposes on the loss functions imply that is the signed curvature (see Section 1.6).
Lemma 2.11**.**
A proper loss function is mixable if and only if
[TABLE]
where denotes the curvature of . Moreover, when this holds,
[TABLE]
Proof.
By Lemma 2.10, is proper with mixability constant if and only if
[TABLE]
for any given 1-chart parametrization . Setting and using \[email protected], we have the following. For any ,
[TABLE]
where we used that by properness (see \[email protected]).
Since is independent of the parametrization, we obtain the result. ∎
Remark 2.12**.**
Lemma 2.11 exemplifies the usefulness of . The curvature of is easily computed with respect to the standard parametrization, by fixing we can easily recognize when the curvature of appears in our computation. However, since curvature is a geometric quantity we know this relation between curvatures will hold for any parametrization too.
Using this point of view, the following observations enlighten why the weight function in [BSS05] and in [RW10] basically encodes all the relevant information in the binary case. Recall that given a proper loss function , the weight of (with respect to a local parametrization of ) is defined as
[TABLE]
We stress that the weight depends on the coordinates of that we use, and hence we use the notation . As observed in Remark 2.12, we sometimes set (as it is done in [BSS05, RW10]) to be able to recognize some terms.
Lemma 2.13**.**
Let be a proper loss and a local parametrization of , denote by its local expression and by be its weight. Then we have for any ,
[TABLE]
and moreover, if is another proper loss,
[TABLE]
In particular, when ,
[TABLE]
and if in addition, (with ),
[TABLE]
Proof.
Let be a proper loss and let be any parametrization of around . Let us compute (assuming w.l.o.g. that , which means and ).
[TABLE]
where we have used that by properness we know that (\[email protected]), which implies by differentiating with respect to from the third to the fourth equality, and that since from the third to last to the second to last equality.
Notice that in the last equation of the previous string of equalities, the only term involving is (or more precisely ) and the remaining terms depend only on the parametrization . Then we obtain
[TABLE]
The remaining statements follow from setting and \[email protected]. ∎
Remark 2.14**.**
Combining Lemma 2.11 and \[email protected], we recover the characterization of the mixability constant in terms of the quotient of weights obtained by van Erven–Reid–Williamson in [vERW12, Section 4.1]. However for the corresponding statement involving the quotient of second derivatives of the Bayes risks, the fact that has an affine parametrization is important. Indeed, this relies on Corollary 3 in [RW10] that states that . In general, it can be checked that
[TABLE]
which reduces to when . From the point of view of the present work, (or a quotient of them) is not a good quantity to consider since it strongly depends on coordinates. However, notice that if one restricts to affine parametrizations of then depends on and and hence in view of Lemma 2.13 restricting to a fixed affine parametrization of will make quotients of the second derivative of the Bayes risk well behaved.
Let us remark some points about Lemma 2.13.
- •
Let be a given strictly proper, fair, loss function. Given a parametrization, we obtain a weight given by \[email protected], that is, the weight depends on the parametrization.
- •
The curvature of is independent of up to a sign. However, when defining we made the choice of the sign in a uniform way, thus the curvature is independent of the parametrization for the family of losses considered here. Then it follows that the quotient of curvatures is independent of the parametrization and by \[email protected], it also follows that the quotient of weights is also independent of the coordinates (despite the weights being coordinate dependent themselves).
- •
A corresponding notion of weight in higher dimensions (for the multi-class case) is way more complicated and it is unclear whether using them would lead to successful results. One higher dimensional analog of curvatures is readily seen to be the so called “principal curvatures” of a hypersurface in Euclidean space (see Appendix A). This will be the main motivation when dealing with the multi-class case (Section 3) Alternative ways to characterize proper higher dimensional loss functions have been studied in [WVR16].
2.4. Geometric comparison of loss functions
Fix a proper, fair loss function . Given another proper, fair loss function , how might we compare them? From the point of view of differential geometry, since given the normal vectors at and coincide, it is natural to look at their curvatures. Motivated by Lemma 2.11, we impose (for the moment) the condition
[TABLE]
Note that this implies that for all . We divide the comparison in steps for clarity.
- (1)
Expressing as a function. Note that since is proper and fair, the normal vector to a point can only be when (i.e., when evaluating at the boundary of ). Thus, the set can be expressed as a graph over the -axis. To obtain an explicit expression let . We use the fact that (where could be infinity) is invertible. Then, we have that
[TABLE]
where . 2. (2)
Translating and parametrizing . Let with , if such does not exist then . We define by , i.e., we translate so that it coincides with at . ( is not fair anymore, however, the curvature is invariant under translations.)
We now parametrize as the graph of a function defined on an interval around (the -coordinate of ), “aligning” it with (we can assume this interval to be maximal). We let . Since , we know that around the graph of is to the northeast of . 3. (3)
Comparison. If the graph of is to the northeast of on the whole , then we see that the superprediction set of is contained in that of . If this does not hold, it means that there is such that , and w.l.o.g. we can assume . Thus we know that on and on , i.e., the boundary of . Define the second order operator which computes the curvature of the graph (see \[email protected]):
[TABLE]
Since , we see that on . The maximum principle now implies that the supremum of is attained at the boundary on , and hence we know that on , which is a contradiction. Thus the superprediction set is contained in the superprediction set of (see Section 4).
More generally, if we assume instead that
[TABLE]
for some , we see that (see Appendix A) that satisfies
[TABLE]
That is, we can reproduce the previous analysis with instead of .
The previous discussion motivates right away a comparison between proper, fair loss functions.
Definition 2.15**.**
Let be a proper, fair loss in , which we call a base loss. We say that a proper, fair loss is mixable with respect to if
[TABLE]
2.5. Mixability and fundamentality as comparison to the log loss
Now, suppose is proper and fair. Thus, in particular for all . We want to think of mixability as a geometric comparison to the log loss as suggested by Vovk in [Vov15] and give a detailed interpretation of this comparison. We fix the standard parametrization of , , given by
[TABLE]
The log loss in these coordinates is thus given by
[TABLE]
and by \[email protected], its curvature with respect to the unit normal pointing towards is given by
[TABLE]
Notice that for all and as or . Thus, clearly by Lemma 2.7, for any proper subinterval of (cf. [Vov15, Corollary 2]), we have
[TABLE]
Thus, whether a proper, fair loss function is mixable or not will depend of the behavior of the quotient as approaches and . More precisely, we have obtained the following.
Lemma 2.16**.**
Let be a proper loss. Then is mixable if and only if
[TABLE]
Motivated by this we make the following definition.
Definition 2.17**.**
Let be a proper, fair loss function in , and be the standard parametrization of . We say that is -logarithmic at the boundary if
[TABLE]
Let us analyze what this means. Suppose that is proper and -logarithmic. Then for any , using \[email protected] in Lemma 2.13 and \[email protected], we have
[TABLE]
Notice that as ,
[TABLE]
and similarly,
[TABLE]
that is, we are only comparing the rate at which , , go to 0 (since they do by fairness) with the rate at which the log loss does.
In [Vov15], Vovk defines a loss function to be fundamental if given a (computable, proper, mixable) loss function and a data sequence in that is random under with respect to a prediction algorithm , then it is random under with respect to . He shows that a fair, mixable is fundamental if and only if (using the notation in [Vov15])
[TABLE]
Since we have seen that mixability can be regarded as a comparison of curvatures of the loss curve of and that of and we have reinterpreted fundamentabiliy as a comparison of and near the boundary building on Definition 2.15, we can easily come up with a notion of -fundamentality.
Definition 2.18**.**
Let be a proper, fair loss function in . We say that a proper, fair loss function is -fundamental if
- •
* is mixable with respect to , and*
- •
when , we have
[TABLE]
Suppose now that a mixable loss function is fundamental. Then there exist such that
[TABLE]
for all . This implies that
[TABLE]
for all , which readily implies (Appendix A) that
[TABLE]
for all .
Rephrasing the previous discussion we have obtained the following characterization of fundamentality.
Theorem 2.19**.**
A loss function is fundamental if and only if there exist numbers , such that for any , there are translation vectors and in such that
[TABLE]
2.6. Constructing new mixable losses from previous
We now observe how mixability helps us to construct new proper, fair and mixable functions from previous proper, fair and mixable losses. We first define a family of loses that will serve to illustrate the idea. We set and . Let and define the loss function
[TABLE]
It can be readily checked that , thus since
[TABLE]
it follows that is 1-mixable for and it is not if . Note that is still proper and fair. Take then , we can readily see that there exists a proper, fair an mixable loss function such that
[TABLE]
Indeed, , which is fair, proper and 1-mixable.
This process works in a more general setting than scalings of . Consider for example the spherical loss defined in coordinates by
[TABLE]
It can be easily checked that this is bounded, proper and fair and that . Thus
[TABLE]
thus is 1-mixable. Thus, as before, there is a loss function such that . Moreover, the loss function given (in coordinates) by
[TABLE]
which can be seen to be unbounded, proper, fair and mixable.
We close this part with the following observation. Suppose that is a proper, fair, mixable loss function with mixability constant . Then the loss function is 1-mixable. Thus, there exists a proper, fair, mixable loss such that
[TABLE]
As we will see in Section 4, the previous observation can be interpreted from the point of view of the superprediction sets of the involved loss functions and convex geometry: slides freely inside (see Theorem 4.23).
2.7. Composite losses and the canonical link
In this part we discuss composite losses following [RW10]. Let us recall their setting. Let be a set of prediction values. A link function is a continuous map . Given a loss function and assuming , if is invertible, we define the composite loss as
[TABLE]
Definition 2.20**.**
A composite loss is a proper composite loss if is a proper loss in the sense of [RW10].
Recall that in [RW10], is implicitly assumed. Then, given a loss function (in the [RW10] sense), we can construct a loss function , by . Then, the composite loss can be expressed as
[TABLE]
In other words, the composite loss is the local expression of with respect to the parametrization of . We denote the local expression of with respect to by , that is
To show how this reconciliation of terms work, we obtain a result similar to Corollary 12 in [RW10]. Suppose that a composite loss is given and it has differentiable partial losses (i.e., the corresponding loss is in ), furthermore, we assume that is a diffeomorphism which in one dimension means it is strictly monotonic. Then we know that is strictly proper if and only if is strictly proper (by definition). This implies that is normal to at for all and its curvature is positive (with respect to the unit normal pointing towards ). This means for all ,
[TABLE]
where we have used that is a diffeomorphism and that for all parametrizations of . Therefore, we have
[TABLE]
that is
[TABLE]
for all .
Since we are working with valid reparametrizations the choice of will not affect the curvature of . Hence we obtain
Corollary 2.21**.**
A composite loss is strictly proper if and only if is strictly proper and satisfies
[TABLE]
for all .
Remark 2.22**.**
We have seen that whether a loss function is strictly proper or not, depends on whether conditions \[email protected] and \[email protected] hold or not. Notice that under a (admissible) change of coordinates, for example given by a link , \[email protected] will not be modified. However, \[email protected] might change (since in a way, we are changing the “velocity” at which we move on ). Hence, Corollary 2.21 is giving us a way to define the set of admissible links (or reparametrizations of ) given a loss function and the standard parametrization of . In this case, the new parametrization is given by .
For applications, it is desired to be able to work with a given composite loss , and moreover, to have convexity of the partial losses and . From our point of view, we see as the local expression of some , so that .
Let us work with the partial losses separately:
[TABLE]
[TABLE]
Proceeding as in the proof of Lemma 2.4, properness implies
[TABLE]
or, equivalently,
[TABLE]
Therefore, we can define as
[TABLE]
where is the weight of , we can rewrite the derivatives of the partial losses of as
[TABLE]
[TABLE]
Taking second derivatives we have
[TABLE]
[TABLE]
A way to guarantee both expressions are positive is as follows. Assume w.l.o.g. that . Since we are assuming , is increasing and is decreasing (also we have is increasing and is decreasing). We readily see that imposing
[TABLE]
for all , is enough to guarantee both second derivatives to be strictly positive.
Definition 2.23**.**
Given strictly proper, we define the canonical link as the link defined by
[TABLE]
for , where is defined in \[email protected].
The differential equation \[email protected] can be seen as separable ordinary differential equation, which is solvable for loss functions in .
To give a geometric meaning, we look at the norm of the velocity of the loss curve .
[TABLE]
By assuming and , we have
[TABLE]
Thus the canonical link gives a parametrization of such that is a curve such that its velocity vector at coincides with the length of the vector . In other words, it is a parametrization of the loss curve such that for , the tangent vector at the point has length . We close this discussion with a charcterization of the canonical link.
Theorem 2.24**.**
Let be a stxrictly proper loss function and its canonical link. The reparametrization of determined by its canonical link is a parametrization of with weight equal to 1.
Proof.
Let be the reparametrization of determined by the canonical link. Since
[TABLE]
for all , and from Definition 2.23
[TABLE]
for all , we have
[TABLE]
Thus . ∎
3. Mixability for Multi-Class Classification
Now we focus our attention on multi-class classification loss functions, that is, maps given by the partial losses
[TABLE]
Our main goal is to interpret mixability as a geometric comparison of a given loss function to the log loss, as we did for the binary case. As suggested by the comments after Remark 2.11, the extra work of characterizing properness and mixability in a geometric way (coordinate independent) will pay off since to carry out the comparison we will look at the scalar second fundamental forms of and . The scalar second fundamental form measures how a Riemannian manifold curves inside an “ambient space”, in this case how curves inside (see Appendix A for details).
The definition of (Definition 2.1) can be extended to higher dimensions.
Definition 3.1**.**
An admissible loss function is a map such that
- (i)
* is a -manifold of class ,* 2. (ii)
there exists a differentiable map , , where is the normal space of , and 3. (iii)
* or belongs to for all .*
We denote the set of admissible loss functions as , or simply when the dimension is clear from context.
We fix the log loss and denote it for convenience by , as the map
[TABLE]
for .
Let and consider a parametrization of around . The local expression of the conditional risk (using the parametrization of around ) is given by
[TABLE]
where and .
Imposing to be proper implies that when fixing , is a critical point of , that is,
[TABLE]
for all . Note that since the tangent space of at , , is generated by , we conclude that is a normal vector. In other words, as before, we have
[TABLE]
for all .
The fact that achieves a minimum at (at interior points) is equivalent to requiring that the Hessian, , is positive definite at . The Hessian of at is given by
[TABLE]
The next step is to relate to the scalar second fundamental form of (see Appendix A for its definition). More precisely, we compute the with respect to a local parametrization of , i.e., we obtain the matrix representing . To do this we need to compute the second derivatives of its parametrization (Appendix A). Since,
[TABLE]
we have
[TABLE]
The scalar second fundamental form (with respect to the normal vector pointing towards ) is then given by
[TABLE]
for , thus if is positive definite, then the matrix is positive definite. In this case its eigenvalues are strictly positive and hence, the principal curvatures of at (see Appendix A), (with respect to the unit normal pointing towards ) are all positive. Therefore, using a similar reasoning as we did in the case , we have obtained the following geometric characterization of properness (by following the same arguments as in Section 2).
Lemma 3.2**.**
Let . is strictly proper if and only if and the principal curvatures of at , (), are strictly positive for all .
We briefly explain how the comparison of scalar second fundamental forms will be performed. We follow a similar procedure as the one described in Section 2.4 for the case .
- (1)
We establish that given a proper loss function , around every , can be parametrized as a graph of a function defined on a neighborhood around some such that . We do this explicitly for the log loss . 2. (2)
Since and are proper, the normal vector to and at and , respectively, is . Hence we can identify their tangent spaces at these points. We do so and fix the parametrizations given in step (1). 3. (3)
By assuming -mixability of , we look at the principal curvatures of and prove an equivalent condition for them to be non-negative with respect to normal vector field pointing towards (i.e., convexity). The condition to be satisfied is seen to be comparison of the scalar second fundamental forms of and that we can recognize by step (1). 4. (4)
We interpret this comparison as follows. Since the tangent spaces to (and ) and coincide for the chosen point , if we translate to coincide to at , call this tangent space (and note it can be indetified with the supporting plane of the loss functions at the given point). Then if we express (locally) and over , the graph of lies above the graph of . See Figure 6.
3.1. Representing proper loss functions as graphs over Euclidean spaces
When restricting to the set of admissible loss functions (), we can represent losses as functions over (a similar approach was taken in [vERW12]; the difference relies on the fact that here we are after the comparison of second fundamental forms), which allows us to represent geometric quantities in a simple way. This will be useful to recognize these quantities when comparing a proper loss function to the log loss , as we did for the binary case in Section 2. Let be a proper loss in given by
[TABLE]
Let be the standard parametrization of given by
[TABLE]
where . The local expression of in these coordinates is then given by , so that . Also, we define the projection as .
Recall that properness implies that the normal vector of at can be chosen to be , for . As a consequence, the normal vector is never parallel to the hyperplane , so that around any point with , can be written as a graph over (as regular as is). In general, the existence of this function is guaranteed by the implicit function theorem, however, in our case we can give an explicit description of it as follows. The function is a map with injective derivative, say around for a fixed , therefore, the inverse function theorem ensures the existence (and differentiability) of a local inverse, which we can denote by . This inverse map can be seen as a local parametrization of . Thus, the local expression of (viewed as a map from to ), (where the latter are small neighborhoods around and respectively) is given by
[TABLE]
This map is a diffeomorphism and its inverse , will be denoted by
[TABLE]
We warn the reader about this abuse of notation, is not the inverse of , it is a map satisfying
[TABLE]
We want to define such that . We see that setting , so that it contains , we arrive to
[TABLE]
We have obtained the following result.
Lemma 3.3**.**
Let be a strictly proper loss. Let . Then there exists an open set and a function such that admits the parametrization
[TABLE]
around .
Let and be as in Lemma 3.3. The unit normal vector field (pointing towards ) is then given by
[TABLE]
We proceed to calculate the scalar second fundamental form. The first and second derivatives of are given by
[TABLE]
for , where denotes the canonical basis of and is the 0 vector of . Denote by the scalar second fundamental form of . Thus with respect to this coordinates we have
[TABLE]
for .
3.1.1. as a graph
Fix an arbitrary point . The local expression of (with respect to the standard parametrization around and around ) is given by
[TABLE]
thus, we have
[TABLE]
Fix . Thus, around , using Lemma 3.3, around can be described as
[TABLE]
Moreover, in this case we have the explicit expression . Notice that . We now compute the scalar second fundamental form of at .
[TABLE]
for (here denotes the Kronecker delta). In particular,
[TABLE]
for , and since we have
[TABLE]
for
Remark 3.4**.**
Note that if instead of we would have used a translation of it, that is, for , define a loss function by
[TABLE]
we can repeat the previous computation. The only difference is that we would have a different point instead of .
3.2. Geometric interpretation of mixability
Mixability is defined as a property of the superprediction set of a proper loss . More precisely, is mixable if and only if is convex for some . As before, we can determine whether is convex by looking at its boundary . is convex if the principal curvatures of are non-negative (when defined with respect to the inner pointing normal vector) at all points. Since convexity is a global property that can be tested “locally everywhere”, it makes sense to make the following definition.
Definition 3.5** (-Mixability at ).**
We say that is -mixable at if has non-negative principal curvatures with respect to the unit normal vector pointing towards at .
Clearly, is -mixable at all if and only if it is -mixable.
Let be strictly proper. First, we note that properness implies that the second fundamental forms of and can be compared in the following sense. Given , note that the normal vector to and can be chosen to be . A translation does not affect the geometric properties of (since it is an isometry of ), thus we consider the translated loss , given by
[TABLE]
i.e., we translate by the vector so that both and coincide when evaluated at . Doing so allows us to identify the tangent spaces to and at . We will call the translation of to .
Lemma 3.6**.**
Let be strictly proper. Let and denote the scalar second fundamental form of and (the log loss), respectively. Then, is -mixable at if and only if
[TABLE]
is positive semi-definite, where and denote the second fundamental forms of and in the graphical coordintes described in Lemma 3.3. And therefore, is -mixable if and only if \[email protected] holds for all .
Proof.
Let be an admissible proper loss
[TABLE]
The -exponential projection map is given by
[TABLE]
Let and write around as the graph of a function over , defined on an open set containing , such that . We can directly give a parametrization of around by
[TABLE]
We proceed to compute the second fundamental form of around (with respect to the inward pointing unit normal vector). The first and second derivatives of are given by
[TABLE]
and noting that the (inward pointing) unit vector field is given by
[TABLE]
Therefore, letting , the second fundamental form of at is given by
[TABLE]
Thus, since the convexity of is equivalent to the principal curvatures of being non-negative at for all (with respect to the inner pointing normal vector), we see this will be the case if and only if the matrix
[TABLE]
is positive semi-definite for all corresponding to .
Note that since we have a graphical parametrization of around , we have
[TABLE]
and by \[email protected],
[TABLE]
On the other hand, since the normal vector to at is , we have
[TABLE]
for such that .
By properness we know that
[TABLE]
thus
[TABLE]
and also
[TABLE]
Using \[email protected] and the previous observations, we can rewrite the terms of as
[TABLE]
and
[TABLE]
Now, consider the log loss and its translation to which we denote by to simplify the notation. That is, we have
[TABLE]
As discussed in Remark 3.4, we can write as a graph around (since ). The scalar second fundamental form of at is then given by
[TABLE]
This readily implies that
[TABLE]
Therefore, is -mixable at if and only if we have that
[TABLE]
is semi-positive definite. Since was arbitrary the result follows. ∎
Remark 3.7**.**
The previous comparison of second fundamental forms is possible because properness forces the induced metrics by and to coincide at , that is, (see Appendix A and Remark A.3). The conclusion of Theorem 3.6 does not necessarily hold if one takes a different coordinate system.
In order to get a geometric interpretation (i.e., independent of coordinates) we note the following:
[TABLE]
The matrices and are the local expression of the Weingarten map (see [Lee18] for its definition and properties) of and respectively. The eigenvalues of these matrices are the principal curvatures of and (and they are independent of coordinates), and the determinants are their Gaussian curvatures. From here it also follows that
[TABLE]
that is,
[TABLE]
where denotes the Weingarten map of the loss function . Then once a system of coordinates around is chosen the relation \[email protected] holds. A priori, the relation obtained between the Weingarten maps of and does not provide much information, but it does points to look at the loss function . With this in mind Lemma 3.6 does give a direct geometric interpretation as follows. Let in be a proper loss. Given a point we know that around , can be parametrized with for some function around the point . Let . Consider now the proper loss , for some . We readily see that can be parametrized as with
[TABLE]
with defined around . Now we compute the second fundamental form of at . Notice that
[TABLE]
and hence,
[TABLE]
Then assuming the hypothesis of Lemma 3.6, we obtain
[TABLE]
The supporting planes at and of (or more precisely, of its translation to ) and , respectively, coincide (since the normal vectors are the same), we denote it by . By looking at and locally as graphs over , Lemma 3.6 gives the following comparison of graphs, which in turn can be regarded as local embeddability in the sense of convex geometry (see Definition 4.16 below).
Theorem 3.8**.**
* proper is -mixable if and only if for all the local graph of the translation of to over the supporting plane to both and at , , lies above the graph of over .*
Remark 3.9**.**
We would like to point out the resemblance of Lemma 3.6 to Theorem 10 in [vERW12]. To recover the latter from our point of view we will first reinterpret Lemma 3.6 and Theorem 3.8 from a convex geometry point of view which will lead to a transparent bridge between Lemma 3.6 and [vERW12, Theorem 10].
4. Connections to convex geometry
In this part we reinterpret our results from the point of view of convex geometry. With this interpretation we can relate Theorem 3.8 to results in [vERW12] and [WC22]. We first provide some background and state relevant results from convex geometry which are well-known and can be found in [Sch14] and can be adapted to our setting.
Let be a convex set, that is
[TABLE]
for all and .
We define the recession cone of as the set
[TABLE]
The boundary of is denoted by , as since we will assume that is a differentiable manifold we denote the interior (as a manifold) of by . As usual the scaling of by and the Minkowski sum of and are defined as
[TABLE]
Definition 4.1**.**
Let be a closed convex set in . The support function of , , is defined as
[TABLE]
We sometimes denote it as .
From the definition we know that
[TABLE]
From [Sch14, Section 1.7] we have the following.
Lemma 4.2** (Properties of ).**
Let be closed convex sets.
- (1)
* if and only if .* 2. (2)
* for all .* 3. (3)
.
Definition 4.3**.**
A function is convex if its extension to given by
[TABLE]
is convex.
The following lemma is a well-known result (see [Sch14, Theorem 1.7.1] for example).
Lemma 4.4**.**
Let convex, closed and positively homogeneous, then is the support function of the convex, closed set
[TABLE]
Definition 4.5**.**
Let and closed and convex. We say that is a summand of if there exists a convex, closed set such that .
We will be mainly interested in sets whose recession cone is , hence we denote by the set of closed, convex sets whose recession cone is . In the following we extend some common results in convex geometry which are usually stated for closed, compact convex sets in (see [Sch14]), however, some of them are easily extended to [Shv01].
Lemma 4.6** (Basic properties of sets in ).**
Let and . Then, the following holds:
- (1)
, 2. (2)
, 3. (3)
* is closed, and* 4. (4)
.
Proof.
In order to show (1), we need to show that is closed, convex and . Let and , then we have
[TABLE]
where and for some . Since is convex, then and hence is convex. Let be a convergent sequence that converges to . Then, there exists such that . Since is a constant, converges to (since is closed). By the uniqueness of the limit, . Now, let , we want to show that . Take any ,
[TABLE]
since . Conversely, if , then for any , we have
[TABLE]
then there exists , such that . Hence
[TABLE]
thus , thus .
To show (2), let . We want to show that . Let and , then
[TABLE]
since . Thus . Now, suppose that there is such that . Since is a cone, for all , we have . Let and . Then
[TABLE]
Thus for all , but notice that this is a contradiction since by picking sufficiently large, . Thus .
For (3), see Rockafellar [Roc70] Thm. 8.2 and [Shv01] Thm. 3.1. (4) is simply the combination of (2) and (3) (and the fact that is convex). ∎
We now specialize the discussion to a particular type of sets . First, suppose that the boundary is of class , then at each point there is an outward pointing normal vector . Thus, clearly, we can define a map assigning to . We define
[TABLE]
so that
[TABLE]
Definition 4.7**.**
Define as the collection of sets with boundary of class , and such that the map is a -diffeomorphism from to .
We now specialize some properties of the support function to .
Lemma 4.8**.**
If , then .
Proof.
Take in , then it must be an outward normal vector to , hence it is in . Then . Now, for , normalize it to make it unitary by letting , then and thus it must be a normal vector form some , hence the support function evaluated at is finite, and in consequence is finite too. ∎
Remark 4.9**.**
Following Schneider [Sch14, Section 2.5] the condition is equivalent to assuming the principal curvatures of to be non-zero. It also follows that
[TABLE]
and moreover, is of class .
Remark 4.10**.**
Let be a proper loss function. By definition we see that Remark 4.9 implies (since ).
Definition 4.11**.**
Let . We say that slides freely inside if to each boundary point of , there exists a translation vector , such that .
Theorem 4.12**.**
Let . is a summand of , then slides freely inside .
Proof.
Suppose that there exists such that . Let . Then there are and such that
[TABLE]
Thus, . ∎
Remark 4.13**.**
For a general convex set , if is a summand of we see that the previous proof holds an we conclude that slides freely inside ; note however that this imposes restrictions on possible sets . One of this consequences is that the principal curvatures of must be positive as can be seen from a second fundamental form comparison and Theorem 3.8.
Lemma 4.14**.**
Let and suppose that is convex. Then the set
[TABLE]
is in , and it is such that , that is, and are summands of .
Proof.
From Lemma 4.8, the domain of is , i.e., is convex. Thus it is the support function of (by Lemma 4.4). That is, .
Therefore we have , and hence . Note that is a summand of , then using Theorem 4.12 we know that slides freely inside , and since has positive principal curvatures then does too (Remark 4.13). Since is of class , then has to be in . ∎
Theorem 4.15**.**
[[Sch14, Theorem 1.5.2]] Let convex and let be a continuous function. Suppose that for each point there are an affine function on and a neighborhood of such that and in . Then is convex.
Definition 4.16**.**
We say that is locally embeddable in if for all , there is a and a neighborhood of , such that .
Theorem 4.17**.**
Let and strictly convex. If is locally embeddable in , then is a summand of .
Proof.
Let and be a point such that . Since is locally embeddable in there are and a neighborhood of such that . Since is continuous, there exists a neighborhood of such that . Then it follows that and for all by Lemma 4.2.
Let (this is defined on and is positively homogeneous), and . Then, clearly, we have
- (i)
, since
[TABLE] 2. (ii)
on ,
[TABLE]
It follows by Theorem 4.15 that is convex, and by Lemma 4.14 we conclude that is a summand of . ∎
The following lemma is a direct consequence of the characterization of mixability in Theorem 3.8 and Definition 4.16.
Lemma 4.18**.**
Let be a proper loss. For , if is -mixable then is locally embeddable in .
Lemma 4.19**.**
If is -mixable then slides freely inside .
Proof.
Let be -mixable, then Lemma 4.18 implies it is locally embeddable in . Then Theorem 4.17 implies it is a summand and Theorem 4.12 implies it slides freely inside . ∎
Corollary 4.20**.**
Let be a -mixable proper loss. Then and it slides freely inside ( is the log loss). Additionally, there exists such that
[TABLE]
Moreover, can be regarded as for a 1-mixable proper loss .
Proof.
Since is an -mixable proper loss function, is also a proper loss function and hence (Remark 4.10). Theorem 3.8 implies that is locally embeddable in . From Theorem 4.17 we know that is a summand of , which proves the existence of . As a consequence, is a convex set with recession cone (Lemma 4.14). By applying [WC22, Proposition 21] we can regard as the image of a proper loss function , which since is a summand of it is 1-mixable (Lemma 4.14). ∎
We now state [Sch14, Theorem 2.5.4] adapted to our setting which will be helpful to relate our work to [vERW12].
Theorem 4.21**.**
Let . Let denote the second fundamental form of at with respect to (see \[email protected]). The following are equivalent:
- (i)
* for all pairs of points and at which .* 2. (ii)
* is a support function.*
Since is an affine manifold, the geodesics in are simply straight lines. This allows to define convexity of functions defined on in the usual way we do for functions on . The following theorem connects and reconciles our results to those in [vERW12]. More precisely, we create a bridge between our results and [vERW12, Theorem 10].
Theorem 4.22**.**
Let be proper loss. Let , then is -mixable if and only if is convex on , where denotes the Bayes risk of the loss function (Definition 1.3) and denotes the log loss.
Proof.
Suppose that is a proper loss in which is -mixable. By Lemma 4.19 slides freely inside and in particular . By Theorem 4.21 it follows that is a support function with domain , in particular it is convex on its interior. Let , such that the outward normal vector of and at and , respectively, is . Then we have for ,
[TABLE]
which proves the claim. ∎
Suppose now that for given proper, there exists a such that slides freely inside . Note that in particular this implies that is locally embeddable in , and hence for each we have
[TABLE]
which by \[email protected] and Lemma 3.6 implies that is -mixable. Thus combining this with Lemma 4.19 we obtain the following characterization of mixability of proper (sufficiently differentiable) loss functions.
Theorem 4.23**.**
Let be proper. is -mixable if and only if slides freely inside , where denotes the log loss.
In general, the set provides a family of loss functions with appealing properties. Arguably, one of the most important properties is that given , if we assume that is proper then we know its principal curvatures are strictly positive. This is a strong and useful geometric feature. For example, in [WC22] the notion of a “inverse loss” called the anti-polar loss was investigated. Given a proper loss (in the sense of [WC22], which are not necessarily smooth), they consider the 0-homogeneous extension of (see Remark 26 in [WC22]), defined on and given by
[TABLE]
where . For the following we simply denote by . In [WC22, Proposition 29] it is shown that there exists a map such that
[TABLE]
for all . The map is called the anti-polar loss of . For the family of admissible loss function considered in this work, we exploit the differentiability conditions to obtain in a straightforward way an inverse loss defined on . To see this, suppose that is proper. Since this is equivalent to saying that is in , meaning that the map is diffeomorphism. Then we can define the map by
[TABLE]
which is the inverse of the map . Recall that is nothing else than the unit normal vector (pointing towards ) at .
It is of interest of finding parametrizations (or links) that simplify the expression of a given proper loss . At a theoretical level there are potentially many ways to to this. Notably we have at hand the notion of canonical link in [WVR16] (or see Section 2.7 above for ). As an example of other ways to obtain nice links we have Lemma 3.3 above, which gives a nice expression in coordinates (as the form of a graph) of . Unfortunately, to obtain that results one makes uses of the inverse function theorem which does not provide an explicit inverse but rather its existence.
5. Conclusions
We summarize the main messages of this work.
- •
Since mixable loss functions are of great importance in prediction games, it is desirable to understand them from different perspectives. Inspired by the work of Vovk [Vov15], in Section 2 we studied binary loss functions from the point of view of differential geometry, hence restricting to loss functions in (Definition 2.1). To do this, we re-interpret properness as a geometric property, namely, a loss function is proper if and only if
- –
the normal vector (belonging to ) to at is , for any , and
- –
the loss curve has positive curvature (with respect to ).
Having this framework at hand, we characterized mixability and fundamentality of a proper loss , as a curvature comparison to the log loss (cf. [Vov15]).
- •
In Section 3, we extended the geometric characterization of proper loss functions to higher dimensions, and obtained the corresponding interpretation of mixability as a geometric comparison (now in terms of the principal curvatures of the “loss surface”). This comparison is done by using the second fundamental forms of the “loss surfaces”.
- •
The main goal of Section 4 is to re-interpret the geometric results in Section 3 from the point of view of convex geometry. The main result in this part is a new characterization of -mixability of a proper loss function , as sliding freely inside (in general dimension). This provides an intuitive and geometric way to interpret mixability.
- •
Since the results obtained in this work are in terms of curvature, it was necessary to re-interpret well known properties of loss functions in the language of differential geometry. Although this task might seem tedious at first, it is well worth it since it reconciles the results obtain by Vovk [Vov15] for and by van Erven, Reid and Williamson [vERW12] for .
- •
It is worth to point out the relation of this work with [vERW12]. Specifically, the bridge between these to works established by Theorem 4.22 connects our results to Theorem 10 in [vERW12] in the following way. In [vERW12, Theorem 10] the following statements are proven to be equivalent:
- (i)
a proper loss is -mixable, 2. (ii)
is positive semi-definite for all , where denotes the Hessian of at , 3. (iii)
is convex on , and 4. (iv)
is convex on .
There, they first proved the equivalence of (i) and (ii), which is the result of a long direct computation done very carefully. The equivalence between (iii) and (iv) is straightforward. To connect these two sets of equivalences, standard convex geometry is used to prove the equivalence of (ii) and (iii). Note that the statements (ii) and (iv) make reference to a precise choice of parametrization of (i.e., the standard parametrization ), therefore, the work presented here is naturally not related to these statements but rather to (i) and (iii), whose equivalence can be considered to be the content of Sections 3 and 4. Determining whether this new approach provides a simplification of the computations in [vERW12] or not, strongly depends on the differential geometry and convex geometry background of the reader. This work should be considered as complementing the understanding of mixable loss functions and providing a new geometric insight into them.
Appendix A Differential Geometry
In this part we provide a brief summary of the concepts of differential geometry that are used in this work (we assume the reader has some familiarity with the topic although we try to put emphasis on the intuition). We do not intend to give a comprehensive introduction to the topic. Most of the material can be found in almost any differential geometry book, however, we recommend (and when possible use the notation of) [dC16] and [Lee18].
A.1. Curvature of Curves
A parametrized curve is a differentiable map , (). We are interested in studying the geometry of parametrized curves. For this it would be useful to restrict our discussions to curves with a well defined tangent line at every point for (i.e., with non-vanishing ). These curves are called regular. Let be a diffeomorphism, the curve is a reparametrization of . Note that in this case . The image is a 1-dimensional differentiable manifold in (for this it is essential to restrict to regular curves). The study of curves is of particular importance since some aspects are carried to the study of the geometry of general hypersurfaces in .
Typically, curvature is defined for curves parametrized by arc-length meaning that for all (and a regular curve can always be parametrized this way). For these types of curves, the curvature of at is defined as the length of , which measures “how much” a curve “curves”. However, this notion does not give information about the direction on which a curve is “curving”. We start looking at the case . We define the signed curvature of a general curve by (cf. \[email protected])
[TABLE]
It can be checked that coincides with the curvature of when parametrized by arc-length (at the corresponding point), the signed curvature is well defined up to a sign (the sign will change if we consider a reparametrization that reverses the order of , for example a curve defined on given by ), which motivates the discussion in Section 1.5.
For example, suppose that a planar curver is defined by a function is the following way:
[TABLE]
for . A quick computation gives
[TABLE]
Given a regular curve as above and a real number , it is straightforward to see that the curve is also a regular curve and its signed curvature is given by
[TABLE]
The notion of signed curvature can be extended to curves in manifolds sitting inside (see for example [Lee18, Chapter 8]). For parametrized by arc-length, the signed curvature (with respect to ) of at is given by . It can be shown that this definition agrees with the one we gave for .
A.2. Geometry of hypersurfaces in
Let be a differentiable hypersurface inside of class (i.e., a -dimensional manifold). By this we mean that for each there is an open set and a injective map (called a parametrization of around ). For each , forms a basis for the tangent space () to at . Since we can consider the induced metric on by the Euclidean metric in (denoted by ). This is a Riemannian metric on given on the coordinates given by by the matrix
[TABLE]
for . The metric allows us to define the length of a curves in .
In general, if a manifold of dimension is sitting inside an -dimensional Riemannian manifold (and is endowed with the induced metric from ) the second fundamental form carries the information on how is “curved” inside . Let be the metric on and the induced metric on by . Let denote the Levi–Civita connection of . Let be a smooth unit normal vector field to (that is is perpendicular to for each ). The scalar second fundamental form of with respect to is the covariant 2-tensor on defined as
[TABLE]
for tangent vectors to . Note that for a hypersurface, at each point we have exactly to unit normal vectors to at , thus the scalar second fundamental form is well-defined up to a sign. Fixing a point and an orthonormal basis for the tangent space at , the eigenvalues of the matrix given by for are called the principal curvatures of at and the corresponding eigenspaces are called the principal directions. For details of the above see Chapter 8 in [Lee18].
When and is parametrized by , with respect to the local frame of , the scalar second fundamental form with respect to a normal unit vector field is given by ([Lee18, Proposition 8.23])
[TABLE]
for .
Given any and , there a geodesic of passing through with velocity at . Let and be two hypersurfaces in tangent at a point . Choose a normal vector and suppose that lies above (with respect to ). We have the following lemma from [Lee18].
With the previous lemma we can obtain a comparison result for manifolds with positive principal curvatures.
Lemma A.1**.**
Suppose that and are tangent at and fix a normal vector at . Suppose that and have positive principal curvatures at . Then for all if and only if lies above (with respect to ) locally around .
Proof.
First we make the following observation. Suppose that is a smooth hypersurface in and we have a regular curve such that and for some and . Then, letting denote the second fundamental form of from \[email protected] we have
[TABLE]
Thus, if is parametrized by arc-length, .
Suppose lies above are tangent at and let with . Then we can intersect and with the plane generated by and . Then we obtain two curves and on and , respectively, such that and for . Moreover, we can assume that these curves are parametrized by arc-length so its Euclidean curvature is given by . Since we can regard these curves as planar curves, there are functions and such that the curves and are represented in the plane by the curves
[TABLE]
with , , (since and have positive principal curvatures at ) for . By construction and by definition , for .
If lies above at , then and hence , which is equivalent to for any with . Let be arbitrary, then
[TABLE]
as claimed.
Conversely if \[email protected] holds, then we see that in particular holds for unitary , which ultimately means that for all unitary . This implies that lies above . ∎
We present the following instructive example.
Example A.2**.**
Consider the differentiable function with , and let . We choose the parametrization of and compute the scalar second fundamental form of at in these coordinates. We have
[TABLE]
thus from \[email protected] at the point , the scalar second fundamental form of with respect to is given by
[TABLE]
and in particular for we have
[TABLE]
Thus, clearly we have
[TABLE]
which is positive definite if and only if (when lies inside and are tangent at ).
Remark A.3**.**
We stress a technical observation. The comparison \[email protected] in Example A.2 is valid since regardless of the value of , and are the same, meaning that we can identify the tangent spaces to and at for all , and the basis for them is given by . In general this is not necessarily the case so one should perform a change of basis before comparing the second fundamental forms.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[BSS 05] Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Technical report, University of Pennsylvania , 2005.
- 2[d C 16] Manfredo P. do Carmo. Differential geometry of curves & surfaces . Dover Publications, Inc., Mineola, NY, 2016. Revised & updated second edition of [ MR 0394451].
- 3[HKW 95] David Haussler, Jyrki Kivinen, and Manfred K. Warmuth. Tight worst-case loss bounds for predicting with expert advice. In Computational learning theory (Barcelona, 1995) , volume 904 of Lecture Notes in Comput. Sci. , pages 69–83. Springer, Berlin, 1995.
- 4[HKW 98] David Haussler, Jykri Kivinen, and Manfred K. Warmuth. Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory , 44(5):1906–1925, 1998.
- 5[Lee 18] John M. Lee. Introduction to Riemannian manifolds , volume 176 of Graduate Texts in Mathematics . Springer, Cham, 2018. Second edition of [ MR 1468735].
- 6[MW 18] Zakaria Mhammedi and Robert C Williamson. Constant regret, generalized mixability, and mirror descent. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018.
- 7[RFWM 15] Mark D. Reid, Rafael M. Frongillo, Robert C. Williamson, and Nishant Mehta. Generalized mixability via entropic duality. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory , volume 40 of Proceedings of Machine Learning Research , pages 1501–1522, Paris, France, 03–06 Jul 2015. PMLR.
- 8[Roc 70] R. Tyrrell Rockafellar. Convex analysis . Princeton Mathematical Series, No. 28. Princeton University Press, Princeton, N.J., 1970.
