The Geometry of Mixability

Armando J. Cabrera Pacheco; Robert C. Williamson

arXiv:2302.11905·cs.LG·February 24, 2023

The Geometry of Mixability

Armando J. Cabrera Pacheco, Robert C. Williamson

PDF

Open Access

TL;DR

This paper offers a geometric perspective on mixable loss functions, characterizing their properties through differential geometry and superprediction sets, which unifies binary and multi-class cases.

Contribution

It introduces a geometric characterization of mixability for proper loss functions using superprediction sets, providing a coordinate-free framework that unifies binary and multi-class scenarios.

Findings

01

Superprediction sets slide freely inside the log loss superprediction set for mixability.

02

The geometric approach applies under general differentiability assumptions.

03

Reconciliation of previous results for binary and multi-class cases.

Abstract

Mixable loss functions are of fundamental importance in the context of prediction with expert advice in the online setting since they characterize fast learning rates. By re-interpreting properness from the point of view of differential geometry, we provide a simple geometric characterization of mixability for the binary and multi-class cases: a proper loss function $ℓ$ is $η$ -mixable if and only if the superpredition set $spr (η ℓ)$ of the scaled loss function $η ℓ$ slides freely inside the superprediction set $spr (ℓ_{l o g})$ of the log loss $ℓ_{l o g}$ , under fairly general assumptions on the differentiability of $ℓ$ . Our approach provides a way to treat some concepts concerning loss functions (like properness) in a ''coordinate-free'' manner and reconciles previous results obtained for mixable loss functions for the binary and the…

Equations448

\Delta^{n}=\left\{(p_{1},...,p_{n})\in\mathbb{R}^{n}\,\bigg{|}\,\sum_{i=1}^{n}p_{i}=1\right\}.

\Delta^{n}=\left\{(p_{1},...,p_{n})\in\mathbb{R}^{n}\,\bigg{|}\,\sum_{i=1}^{n}p_{i}=1\right\}.

Φ_{std} (t_{1}, ..., t_{n - 1}) = (t_{1}, ..., t_{n - 1}, 1 - i = 1 \sum n - 1 t_{i}) .

Φ_{std} (t_{1}, ..., t_{n - 1}) = (t_{1}, ..., t_{n - 1}, 1 - i = 1 \sum n - 1 t_{i}) .

ℓ (p, k) = i = 1 \sum n [[k = i]] ℓ_{i} (p) .

ℓ (p, k) = i = 1 \sum n [[k = i]] ℓ_{i} (p) .

ℓ (p) = (ℓ_{1} (p), ..., ℓ_{n} (p)) .

ℓ (p) = (ℓ_{1} (p), ..., ℓ_{n} (p)) .

L (p, q) : = ⟨ ℓ (q), p ⟩,

L (p, q) : = ⟨ ℓ (q), p ⟩,

\underline{L} (p) : = q \in Δ^{n} in f L (p, q) = q \in Δ^{n} in f ⟨ ℓ (q), p ⟩ .

\underline{L} (p) : = q \in Δ^{n} in f L (p, q) = q \in Δ^{n} in f ⟨ ℓ (q), p ⟩ .

⟨ ℓ (p), p ⟩ \leq ⟨ ℓ (q), p ⟩

⟨ ℓ (p), p ⟩ \leq ⟨ ℓ (q), p ⟩

E_{η} (y) : = (e^{- η y_{1}}, ..., e^{- η y_{n}}) .

E_{η} (y) : = (e^{- η y_{1}}, ..., e^{- η y_{n}}) .

spr (ℓ) : = {λ \in [0, \infty)^{n} ∣ there is q \in Δ^{n} such that ℓ_{i} (q) \leq λ_{i} for i \in [n]},

spr (ℓ) : = {λ \in [0, \infty)^{n} ∣ there is q \in Δ^{n} such that ℓ_{i} (q) \leq λ_{i} for i \in [n]},

η_{ℓ}^{*} : = η > 0 sup {η > 0 ∣ ℓ is η -mixable} .

η_{ℓ}^{*} : = η > 0 sup {η > 0 ∣ ℓ is η -mixable} .

κ_{α} (t) : = \frac{x _{1}^{''} ( t ) x _{2}^{'} ( t ) - x _{1}^{'} ( t ) x _{2}^{''} ( t )}{( x _{1}^{'} ( t ) ^{2} + x _{2}^{'} ( t ) ^{2} ) ^{3/2}} .

κ_{α} (t) : = \frac{x _{1}^{''} ( t ) x _{2}^{'} ( t ) - x _{1}^{'} ( t ) x _{2}^{''} ( t )}{( x _{1}^{'} ( t ) ^{2} + x _{2}^{'} ( t ) ^{2} ) ^{3/2}} .

L (t, s) = ⟨ ℓ (Φ (s)), Φ (t)⟩ .

L (t, s) = ⟨ ℓ (Φ (s)), Φ (t)⟩ .

0

0

0

ℓ_{1}^{'} (t) ℓ_{2}^{''} (t) - ℓ_{1}^{''} (t) ℓ_{2}^{'} (t)

ℓ_{1}^{'} (t) ℓ_{2}^{''} (t) - ℓ_{1}^{''} (t) ℓ_{2}^{'} (t)

= \frac{ℓ _{1}^{'} ( t )}{Φ _{2} ( t )} [ℓ_{2}^{''} (t) Φ_{2} (t) + ℓ_{1}^{''} (t) Φ_{1} (t)]

= \frac{ℓ _{1}^{'} ( t )}{Φ _{2} ( t )} [⟨ ℓ^{''} (t), Φ (t)⟩] > 0,

⟨ ℓ (p^{*}), p ⟩ = q \in Δ^{2} in f ⟨ ℓ (q), p ⟩ .

⟨ ℓ (p^{*}), p ⟩ = q \in Δ^{2} in f ⟨ ℓ (q), p ⟩ .

ℓ (p) = (- ln (p_{2}), - ln (p_{1})) .

ℓ (p) = (- ln (p_{2}), - ln (p_{1})) .

n_{p}

n_{p}

κ_{ℓ}^{+}

\partial_{s} L (t, s) ∣_{s = t} = 0,

\partial_{s} L (t, s) ∣_{s = t} = 0,

E_{η} (y_{1}, y_{2}) = (e^{- η y_{1}}, e^{- η y_{2}}),

E_{η} (y_{1}, y_{2}) = (e^{- η y_{1}}, e^{- η y_{2}}),

ℓ_{l o g} (p) = (- ln (p_{1}), - ln (p_{2})) .

ℓ_{l o g} (p) = (- ln (p_{1}), - ln (p_{2})) .

ℓ_{l o g} (t) = (- ln (t), - ln (1 - t)) .

ℓ_{l o g} (t) = (- ln (t), - ln (1 - t)) .

n_{ℓ_{l o g} (t)} = - \frac{1}{t ^{2} + ( 1 - t ) ^{2}} ((1 - t)^{- 1}, t^{- 1}) .

n_{ℓ_{l o g} (t)} = - \frac{1}{t ^{2} + ( 1 - t ) ^{2}} ((1 - t)^{- 1}, t^{- 1}) .

κ_{ℓ_{l o g}}^{+} = - κ_{ℓ_{l o g}} = \frac{t ( 1 - t )}{( t ^{2} + ( 1 - t ) ^{2} ) ^{3/2}} > 0.

κ_{ℓ_{l o g}}^{+} = - κ_{ℓ_{l o g}} = \frac{t ( 1 - t )}{( t ^{2} + ( 1 - t ) ^{2} ) ^{3/2}} > 0.

g (t) = (g_{1} (t), g_{2} (t)) = (E (ℓ_{1} (t)), E (ℓ_{2} (t))) = (e^{- η ℓ_{1} (t)}, e^{- η ℓ_{2} (t)}),

g (t) = (g_{1} (t), g_{2} (t)) = (E (ℓ_{1} (t)), E (ℓ_{2} (t))) = (e^{- η ℓ_{1} (t)}, e^{- η ℓ_{2} (t)}),

g_{1}^{'} (t)

g_{1}^{'} (t)

g_{1}^{''} (t)

= η e^{- η ℓ_{1} (t)} [η ℓ_{1}^{'} (t)^{2} - ℓ_{1}^{''} (t)]

g_{2}^{'} (t)

g_{2}^{'} (t)

g_{2}^{''} (t)

= η e^{- η ℓ_{2} (t)} [η ℓ_{2}^{'} (t)^{2} - ℓ_{2}^{''} (t)],

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Immune Systems Applications

Full text

The Geometry of Mixability

Armando J. Cabrera Pacheco

[email protected]

Universtät Tübingen, Tübingen AI Center

and

Robert C. Williamson

[email protected]

Universtät Tübingen, Tübingen AI Center

Abstract.

Mixable loss functions are of fundamental importance in the context of prediction with expert advice in the online setting since they characterize fast learning rates. By re-interpreting properness from the point of view of differential geometry, we provide a simple geometric characterization of mixability for the binary and multi-class cases: a proper loss function $\ell$ is $\eta$ -mixable if and only if the superpredition set $\textnormal{spr}(\eta\ell)$ of the scaled loss function $\eta\ell$ slides freely inside the superprediction set $\textnormal{spr}(\ell_{\log})$ of the log loss $\ell_{\log}$ , under fairly general assumptions on the differentiability of $\ell$ . Our approach provides a way to treat some concepts concerning loss functions (like properness) in a “coordinate-free” manner and reconciles previous results obtained for mixable loss functions for the binary and the multi-class cases.

1. Introduction

In the context of prediction with expert advice as described by Vovk in [Vov98] and [Vov01], an information game is considered between three players: the learner, $n\in\mathbb{N}$ experts and nature. At each step $t\in\mathbb{N}$ ,

•

each expert makes a prediction which the learner is allowed to see,

•

the learner makes a prediction,

•

nature chooses an outcome,

•

for a fixed loss function $\ell$ , the cumulative loss is calculated for the learner and each of the experts.

The goal is to minimize the difference between the learner’s loss and the best expert’s loss, which is often called the regret.

1.1. Mixable games and characterizations of mixable and fundamental loss functions

For a wide class of games, called $\eta$ -mixable games for $\eta>0$ , the Aggregating algorithm (see for example [Vov01]) ensures an optimal bound for the regret ( $\eta^{-1}\ln n$ ) independent of the trial $t$ . Since the mixability of a game depends on the loss function $\ell$ , a loss function $\ell$ is $\eta$ -mixable if the corresponding game is mixable. Since arguably the aggregating algorithm is one of the most well founded and studied prediction algorithms, there is a natural interest in understanding properties and characterizations of mixable loss functions.

Examples of mixable loss functions include the log loss, relative entropy for binary outcomes [HKW98] and the Brier score [VZ09, vERW12]. Mixability of a loss function $\ell$ is characterized by a “stronger convexity” of the superprediction set of $\ell$ , which can be described as the convexity of the superprediction set of $\ell$ after an “exponential projection” (see \[email protected] below and [Vov15] and [vERW12]). Unfortunately, this characterization of mixability lacks a transparent geometric interpretation.

The main goal of this work is to provide such geometric interpretation. The motivation stems from an observation made by Vovk in [Vov15]: a $\eta$ -mixable loss can be characterized as the positiveness of the infimum of the quotient of the curvatures of the a strictly proper loss function $\ell$ and the log loss $\ell_{\log}$ for binary outcomes. Here as usual, loss functions are defined on the 2-simplex $\Delta^{2}$ (see \[email protected]). Moreover, he then proves that fundamentality (see Vovk [Vov15]) of a loss can be characterized as the finiteness of the supremum of the same quotient of curvatures. These two results suggest that these properties are geometric, meaning that they can be studied using differential geometry tools, and in this regard, mixability and fundamentality should not depend on the coordinates chosen to express them.

Loosely speaking, in convex geometry a convex set $L$ is said to slide freely inside a convex set $K$ , if for any point $x$ in the boundary of $K$ , there is a translation vector $y$ such that the translation of $L$ by $y$ (i.e., the Minkowski sum $L+y$ , see \[email protected]), intersects $K$ at $x$ , and $L+y\subset K$ . We provide the following geometric characterization of mixability and fundamentality, as a geometric comparison to the log loss (see Figure 1). Let $\textnormal{spr}(\ell)$ denote the superprediction set of a loss function $\ell$ (see \[email protected]).

Theorem 1.1 (Informal statement).

A continuously twice differentiable proper loss function is $\eta$ -mixable if and only there is $\eta>0$ such that $\textnormal{spr}(\eta\ell)$ slides freely inside $\textnormal{spr}(\ell_{\log})$ . In addition, the same $\ell$ is fundamental if and only if there exists $\gamma>0$ such that $\textnormal{spr}(\ell_{\log})$ slides freely inside $\textnormal{spr}(\gamma\ell)$ .

To obtain the previous theorem it is necessary to re-interpret properness from a differential geometry point of view, which constitutes a big part of this work. However, this technical effort pays off. In [vERW12], van Erven, Reid and Williamson characterized $\eta$ -mixable (differentiable) loss functions for multi-class loss functions and moreover, related $\eta$ to the Hessian of the Bayes risk of $\ell$ and the log loss (see Definition 1.3), which is interpreted as its curvature. By generalizing the tools developed here for the binary case, we were able to obtain a multi-class analog result to Theorem 1.1 and to build a bridge to the results in [vERW12].

1.2. Description of results and structure of the article

Using the same setting as [vERW12], we obtain a geometric characterization of $\eta$ -mixable loss functions in the sense of differential geometry. Loss functions are considered to be maps $\ell\colon\Delta^{n}\longrightarrow\mathbb{R}^{n}_{\geq 0}$ , which under the conditions assumed in this work, give rise to submanifolds $\ell(\textnormal{relint}(\Delta^{n}))$ of $\mathbb{R}^{n}$ whose geometric properties are determined by $\ell$ (see the relevant precise definitions below). We first discuss the case $n=2$ (binary classification loss functions) since it is more instructive, and then the case $n\geq 2$ . We summarize the main results as follows.

(1)

We recast the notion of a (strictly) proper loss as a geometric property of the loss itself rather than its superprediction set. That is, properness is no longer considered a parametrization dependent property, it is a statement about the geometric properties of the “loss surface” $\ell(\textnormal{relint}(\Delta^{n}))$ (the boundary of the superprediction set). See lemmas 2.7 and 3.2. 2. (2)

A geometric comparison is performed. For $n=2$ in terms of the curvature of the “loss curves” (see Section 1.5 below), and for $n\geq 2$ in terms of the scalar second fundamental form of the “loss surfaces” (see Section 3 and Appendix A), which measure how they curve inside $\mathbb{R}^{n}$ . The precise statements are given in Lemma 2.13 and Lemma 3.6. Intuitively, these results tell us how the superprediction set of $\ell$ sits inside the superprediction set of the log loss. 3. (3)

Finally, we interpret our result from the point of view of convex analysis to give a new characterization of mixability. More precisely, We show that a (strictly) proper loss function $\ell$ is $\eta$ -mixable if and only if the superprediction set of $\ell$ slides freely (see Definition 4.11) inside the superprediction set of the log loss.

As byproducts, we obtain a general way to define mixability with respect to a fixed (strictly) proper loss function, further properties and consequences for binary classification loss functions, particularly for composite losses and canonical links, and a bridge to the results obtained in [vERW12].

Since we treat loss functions from the point of view of differential geometry and convex geometry, a considerable background in these topics is needed. We present this work as self-contained as possible and spend some time providing the intuition and motivation for the results (and sometimes the background) which naturally results in a longer exposition. In Section 2 we treat the binary case, in Section 3 the multi-class case to obtain the geometric interpretation of properness and mixability and perform the geometric comparison (in terms of curvature). In Section 4 we make the connections to convex geometry and obtain the geometric characterization of mixability in terms of the sliding freely conditions of superprediction sets.

1.3. Setup

Here we summarize our setup, for more details see [vERW12]. Denote by $[n]$ the set of natural numbers $\{1,...,n\}$ . The set of probability distributions on a finite set $\mathcal{Y}$ with $|\mathcal{Y}|=n\in\mathbb{N}$ is given by

[TABLE]

We note that $\Delta^{n}$ is a manifold with (non-smooth) boundary of dimension $n-1$ . Moreover, $\Delta^{n}$ is a hypersurface in $\mathbb{R}^{n}$ ; we denote the interior (as a manifold) of $\Delta^{n}$ as $\textnormal{int}(\Delta^{n})$ which is the same set as the relative interior $\textnormal{relint}(\Delta^{n})$ of $\Delta^{n}$ . We define the standard parametrization of $\Delta^{n}$ as the map $\Phi_{\textnormal{std}}\colon\Delta^{n-1}\subset\mathbb{R}^{n-1}\longrightarrow\Delta^{n}$ given by

[TABLE]

In particular, when $n=2$ the standard parametrization of $\Delta^{2}$ is the map $\Phi_{\textnormal{std}}\colon[0,1]\longrightarrow\Delta^{2}$ given by $\Phi_{\textnormal{std}}(t)=(t,1-t)$ .

Definition 1.2.

A loss function is a map $\ell\colon\Delta^{n}\times\mathcal{Y}\longrightarrow\mathbb{R}_{\geq 0}$ such that for each $k\in\mathcal{Y}$ , the map $\ell(\cdot,k)\colon\Delta^{n}\longrightarrow\mathbb{R}$ is continuous.

Given a loss function $\ell$ , $p\in\Delta^{n}$ and $k\in\mathcal{Y}$ , the value $\ell(p,k)$ represents the penalty of predicting $p$ upon observing $k$ . We define the partial losses of a loss function $\ell$ as the maps $\ell_{i}\colon\Delta^{n}\longrightarrow\mathbb{R}_{\geq 0}$ given by $\ell_{i}(p)=\ell(p,i)$ . A loss function can be described in terms of its partial losses as

[TABLE]

Thus, we can identify a loss fuction $\ell$ with the map $\ell\colon\Delta^{n}\longrightarrow\mathbb{R}^{n}_{\geq 0}$ determined by its partial losses

[TABLE]

In this work we follow this convention unless stated otherwise. Note that this way we can see a loss function $\ell$ as an embedding of $\textnormal{int}(\Delta^{n})$ into $\mathbb{R}^{n}_{\geq 0}$ (assuming enough properties on $\ell$ ). We will see later that properness ensures the image of this embedding to be a nice hypersurface of $\mathbb{R}^{n}$ with appealing geometric properties. Under the assumption that the outcomes are distributed with probability $p\in\Delta^{n}$ , we make the below definitions following [vERW12, RW10].

Definition 1.3.

Given a loss function $\ell$ , we define the conditional risk as the map $L:\Delta^{n}\times\Delta^{n}\longrightarrow\mathbb{R}$ as

[TABLE]

and the associated conditional Bayes risk as the map $\underline{L}:\Delta^{n}\longrightarrow\mathbb{R}$ given by

[TABLE]

Definition 1.4.

A loss function $\ell\colon\Delta^{n}\longrightarrow\mathbb{R}^{n}_{\geq 0}$ is said to be proper if for any $p\in\Delta^{n}$

[TABLE]

for all $q\in\Delta^{n}$ . In other words, $L(p,\cdot)$ has a minimum at $p$ . When $p$ is the only minimum of $L(p,\cdot)$ we say that $\ell$ is strictly proper.

For our geometric considerations it will be useful to denote the image of $\Delta^{n}$ under $\ell$ by $M_{\ell}$ , and impose enough differentiability conditions on $\ell$ so that $M_{\ell}$ is (at least) a $C^{2}$ -manifold. See Definitions 2.1 and 3.1 below.

We now recall the definition of mixability (see for example, Vovk [Vov15, vERW12]). For $\eta>0$ , let $E_{\eta}\colon\mathbb{R}^{n}\longrightarrow\mathbb{R}^{n}$ be the $\eta$ -exponential projection defined as

[TABLE]

A loss function $\ell$ is called $\eta$ -mixable if the image of its superprediction set, $\textnormal{spr}(\ell)$ , given by

[TABLE]

is convex under the $\eta$ -exponential projection, that is $E_{\eta}(\textnormal{spr}(\ell))\subset[0,1]^{n}$ is convex. We say that $\ell$ is mixable if $\ell$ is $\eta$ -mixable for some $\eta>0$ .

Definition 1.5.

Let $\ell$ be a mixable loss function. The mixability constant of $\ell$ , $\eta^{*}_{\ell}$ , is defined as

[TABLE]

1.4. Motivation

In this part we mainly discuss the case $n=2$ since it is more illustrative. It has been made evident that there is a strong relation between properness and mixability. Here we make this relation more explicit and transparent from a geometric point of view. The basic motivation is as follows. It is commonly understood that properness is a property that depends on the parametrization of the boundary of the superprediction set of $\ell$ [Vov15]. It has been also shown that it is related to the “curvature” of the Bayes risk, since it requires that the superprediction set remains convex under the $\eta$ -exponential projection given by \[email protected] (with the standard parametrization of the simplex $\Delta^{2}$ ) [BSS05, RW10, vERW12]. Mixability is considered to be a stronger notion of convexity [Vov15], for some $\eta>0$ . The basic observation in this work is that it is possible recast properness from a geometric point of view, i.e., independent of the parametrization of $\Delta^{n}$ . More precisely, we define properness in terms of the loss function viewed as a map $\ell\colon\Delta^{n}\longrightarrow\mathbb{R}^{n}_{\geq 0}$ rather than in terms of the superprediction set $\textnormal{spr}(\ell)$ (as it is usually defined). More precisely, to determine whether a given $\ell$ is proper or not, it is not enough to look at image $\ell(\Delta^{n})$ (as the boundary of $\textnormal{spr}(\ell)$ ) but rather how $\Delta^{n}$ is mapped into $\mathbb{R}^{n}_{\geq 0}$ by $\ell$ — since we will be using tools of differential geometry, we will assume $C^{2}$ differentiability (see Section 2). More precisely, restricting first to $n=2$ (see Lemma 2.7 below), a given loss function $\ell\colon\Delta^{2}\longrightarrow\mathbb{R}^{2}_{\geq 0}$ will be (strictly) proper if and only if

(1)

the normal vector to $\ell(\Delta^{n})$ at $\ell(p)$ is equal to $\pm p/|p|$ for all $p\in\textnormal{int}(\Delta^{2})$ , and 2. (2)

the curvature (see Section 1.5 below) at any point $\ell(p)$ with respect to the unit normal vector $\mathbf{n}=p/|p|$ is strictly positive for all $p\in\textnormal{int}(\Delta^{2})$ .

As observed in Figure 2, $\textnormal{spr}(\ell_{1})=\textnormal{spr}(\ell_{2})$ , which implies that their boundaries coincide (as a set). In particular, this implies that it is possible to “parametrize” the boundary of $\ell_{2}(\Delta^{2})$ , $\partial(\ell_{2}(\Delta^{2}))$ , in the same way as $\partial(\ell_{1}(\Delta^{2}))$ in order to have a proper loss. However, note that this changes the map $\ell_{2}$ and hence from the point of view of this work, this is a different loss function. In practice, one is given a loss function $\ell$ rather than a superprediction set $\textnormal{spr}(\ell)$ , therefore we look at losses as individual maps from $\Delta^{2}$ to $\mathbb{R}^{2}_{\geq 0}$ instead of looking at their superpredictions sets and obtaining a proper loss by choosing a convenient parametrization of $\partial(\textnormal{spr}(\ell))$ .

Remark 1.6.

Our strength by characterizing proper loss functions in this way is that we will be able to apply techniques from differential geometry, however, these considerations only work for loss functions which are sufficiently differentiable. For a general set up, it is possible to characterize properness of a loss function in a fairly simple way via the convexity of its superprediction set. More precisely, the “loss surface” is the subgradient of the support function of the superprediction set. This was thoroughly studied by Williamson and Cranko in [WC22]. We briefly explore some connections to our work in Section 4. Alternative approaches to extending and better understanding mixability include [RFWM15] and [MW18].

1.5. Comments about the curvature of planar curves

The second condition for $\ell$ to be proper mentioned above involves a condition on the curvature of $\ell(\textnormal{int}(\Delta^{2}))$ . We now make this notion precise. Recall that if $\alpha(t)=(x_{1}(t),x_{2}(t))$ is a $C^{2}$ curve with $\alpha^{\prime}(t)=(x_{1}^{\prime}(t),x_{2}^{\prime}(t))\neq(0,0)$ for all $t$ in its domain, then its curvature can be seen a measurement of the variation of its unit normal vector at each point. We define the canonical normal vector at $\alpha(t)$ , $\mathbf{n}^{c}(t)$ , as the unit normal vector in the direction obtained by rotating $\alpha^{\prime}(t)$ $90^{\circ}$ counterclockwise. Then, the signed curvature of $\kappa$ at $t$ is defined as

[TABLE]

The interpretation of this number is as follows: $\kappa_{\alpha}(t)$ is positive if $\alpha$ “curves” in the direction of $\mathbf{n}^{c}(t)$ . However, note that at each point we have two normal vectors: $\pm\mathbf{n}^{c}(t)$ . Thus, $\mathbf{n}^{c}(t)$ and $\kappa_{\alpha}$ depend on the direction of $\alpha$ (i.e., $\alpha^{\prime}$ ), and their values differ by a negative sign. Thus, we can talk about the curvature of $\alpha$ with respect to a chosen unit vector $\mathbf{n}$ (either choosing $\mathbf{n}^{c}$ or $-\mathbf{n}^{c}$ for all points, assuming this is possible, which is the case for the curves we will consider here, see Figure 3) and denote it by $\kappa_{\alpha}^{+}$ . In the case when $\mathbf{n}=\mathbf{n}^{c}$ , then $\kappa_{\alpha}^{+}=\kappa_{\alpha}$ , and when $\mathbf{n}=-\mathbf{n}^{c}$ , then $\kappa_{\alpha}^{+}=-\kappa_{\alpha}$ . Since $\kappa_{\alpha}$ is invariant under reparametrizations (up to a sign), we can simply talk at the curvature of $\alpha$ at a given point $p$ in the image of $\alpha$ . In Section 2 we make precise our choice in (2) above. For a summary of geometry of curves see Appendix A.

Going back to loss functions, suppose $\ell\colon\Delta^{2}\longrightarrow\mathbb{R}^{2}_{\geq 0}$ is a loss function. Since $\Delta^{2}$ is a $1$ -manifold, any parametrization around a point (of its interior) can be assumed to be of the form $\Phi\colon(a,b)\subset\mathbb{R}\longrightarrow\Delta^{2}$ for some $a<b$ . Thus, the local expression of $\ell$ under this parametrization $\widetilde{\ell}=\ell\circ\Phi$ is a curve in $\mathbb{R}^{2}$ . By changing $\Phi$ around the same point, we are reparametrizing $\widetilde{\ell}$ . Since curvature is independent of coordinates (i.e., of the $\Phi$ used) up to a sign, we can define the curvature of the loss curve $\ell(\textnormal{int}(\Delta^{2}))$ with respect to a chosen unit normal vector (which will depend only on $\ell$ ). To compute it from its definition in \[email protected], we need to choose a parametrization $\Phi$ , and as we will see, many times it is convenient to take $\Phi=\Phi_{\textnormal{std}}$ .

Remark 1.7.

One could avoid part of the technical complications above by choosing beforehand $\Phi=\Phi_{\textnormal{std}}$ , as it is usually implicitly done, and then requiring $\ell_{1}$ and $\ell_{2}$ to be monotone (cf. [BSS05, RW10, SAM66, Vov15]) – essentially, this amounts to choosing “direction” for the admissible loss curves. Although this approach is appealing since the curve parameter ( $t$ in our case) can be directly interpreted as a probability, and moreover it simplifies calculations since in this case the convention can be chosen so that the signed curvature coincide with $\kappa^{+}$ (see for example [Vov15]), when considering the multi-class case, the notion of “direction” breaks down and it is not clear which properties of $M_{\ell}=\ell(\Delta^{n})$ one should consider. The approach we consider here gives a concrete logical path to a generalization to the multi-class case (see Section 3).

1.6. Reconciling this point of view with previous works

In this part we explain how to “translate” the results we obtain here to previous results regarding proper losses and mixability. We do this in particular with [RW10] and [Vov15].

•

Reid–Williamson [RW10]. Let $\Phi=\Phi_{\textnormal{std}}$ . The parameter $\widehat{\eta}$ in [RW10] corresponds to the parameter $t$ here, $\ell_{1}(\widehat{\eta})$ and $\ell_{-1}(\widehat{\eta})$ correspond to $\widetilde{\ell}_{1}(t)$ and $\widetilde{\ell}_{2}(t)$ , respectively. Although the regularity assumption in [RW10] is initially only differentiability of the partial losses, when discussing the weight of a loss function they impose $C^{2}$ regularity. From Theorem 1 in [RW10], we see that a loss $\ell$ is proper if (in particular) $\ell_{-1}^{\prime}>0$ and $\ell_{0}^{\prime}<0$ . We can heuristically say that $\ell$ goes from “right” to “left”. This means that in this case, $\kappa^{+}_{\ell}(\widehat{\eta})=-\kappa_{\ell}(\widehat{\eta})$ . The log loss in this case is $\ell_{\log}(\widehat{\eta})=\left(-\ln(\widehat{\eta}),-\ln(1-\widehat{\eta})\right)$ .

•

Vovk [Vov15]. In [Vov15] the loss functions are defined as maps $(\lambda_{0}(p),\lambda_{1}(p))$ , with $\lambda_{0}$ increasing and $\lambda_{1}$ decreasing (infinite differentiable). In this case, heuristically, losses go from “left” to “right” so that $\kappa^{+}_{\lambda}(p)=\kappa_{\lambda}(p)$ . To relate this convention to ours, we set $\Phi(t)=(1-t,t)$ . Then the parameter $p$ in [Vov15] corresponds to $t$ and $\lambda_{0}$ and $\lambda_{1}$ correspond to $\widetilde{\ell}_{1}$ and $\widetilde{\ell}_{2}$ . The log loss is then given by $\lambda(p)=\left(-\ln(1-p),-\ln(p)\right)$ .

Therefore, from our point of view, in previous works there is an implicit choice of a parametrization of $\Delta^{2}$ , particularly motivated to interpret the parameter as a probability. However, it is well known that sometimes this might not be the case and a link function is needed [RW10] – this fits well with our approach as a link function for us is a different choice of parametrization; this will carefully explained in Section 2.7. In favor of the study of loss functions using tools from differential geometry we are then motivated to eliminate this choice of parametrization and consider $\ell$ as a map between manifolds (namely, $\textnormal{int}(\Delta^{2})$ and $\ell(\textnormal{int}(\Delta^{2}))$ as a submanifold of $\mathbb{R}^{2}$ ). Although picking a general parametrization of $\Delta^{2}$ complicates the interpretation of the parameter, it makes other properties of loss functions transparent. This approach has, to the knowledge of the authors, never been explored. We remark that, however, one can always set $\Phi=\Phi_{\textnormal{std}}$ and reinterpret the results of this work as the parameter being a probability. With this geometric characterization of loss functions and properness at hand we continue to study mixability.

2. Properness and Mixability for Binary Classification

We first restrict our discussion to binary classification, i.e., setting $n=2$ . Thus, we consider maps $\ell\colon\Delta^{2}\longrightarrow\mathbb{R}^{2}_{\geq 0}$ , where $\Delta^{2}=\{(p_{1},p_{2})\in\mathbb{R}^{2}\,|\,p_{1}+p_{2}=1\}$ , with partial losses $\ell_{1}(p)$ and $\ell_{2}(p)$ . In this case the standard parametrization of $\Delta^{2}$ is given by $\Phi_{\textnormal{std}}(t)=(t,1-t)$ for $t\in[0,1]$ . When a parametrization of $\Delta^{2}$ , say $\Phi$ , is chosen, then the local expression of $\ell$ with respect to $\Phi$ ( $\widetilde{\ell}=\ell\circ\Phi$ ) is a map from some interval $I\subset\mathbb{R}$ to $\mathbb{R}^{2}$ , that is, a curve in the plane $\mathbb{R}^{2}$ .

Dating back to [HKW95, Vov98] it has been established that properness of a loss function imposes strong conditions on the first and second derivatives of their partial losses. In [Vov15] these relations were expressed by means of the curvature of the loss curve. Moreover, in [BSS05, RW10] properness is related to the second derivative of its Bayes risk, which in a way can be interpreted as its curvature. However, in these works there is always an implicit choice of parametrization of $\Delta^{2}$ , which in turn imposes certain restrictions on the “admissible” loss functions, particularly making the results parametrization dependent. In this section, we first recast properness as a geometric property which allows us to obtain results in a parametrization (or coordinate) independent way.

Definition 2.1.

An admissible loss function is a map $\ell\colon\Delta^{2}\longrightarrow\mathbb{R}^{2}_{\geq 0}$ such that

(i)

$\ell(\textnormal{int}(\Delta^{2}))\subset\mathbb{R}^{2}_{\geq 0}$ * is a $1$ -manifold of class $C^{2}$ ,* 2. (ii)

there exists a differentiable map $\mathbf{n}\colon\ell(\textnormal{int}(\Delta^{2}))\to N\ell(\textnormal{int}(\Delta^{2}))$ , $\mathbf{n}(\ell(p))=\mathbf{n}_{\ell(p)}$ , where $N\ell(\Delta^{2})$ is the normal space of $\ell(\Delta^{2})$ , and 3. (iii)

$\mathbf{n}(p)$ * or $-\mathbf{n}(p)$ belongs to $\mathbb{R}^{2}_{>0}$ for all $p\in\textnormal{int}(\Delta^{2})$ .*

We denote the set of admissible loss functions as $\mathcal{L}$ .

Remark 2.2.

We give the following interpretation of the previous definition. (i) simply says that the loss curve (once parametrized) is twice differentiable with continuous second partial derivatives. (ii) prevents some “anomalies” on $\ell$ , for example, $\ell$ can not be constant on a neighborhood of a point. (iii) defines a subfamily of loss curves which are not allowed to vary “too much”. This definition should be compared to the definition of loss functions in Section 2 in [Vov15].

Definition 2.3.

Let $\ell\in\mathcal{L}$ . Let $\mathbf{n}\colon\ell(\textnormal{int}(\Delta^{2}))\to N\ell(\textnormal{int}(\Delta^{2}))$ be the map that assigns to each $\ell(p)$ the normal vector to $M_{\ell}$ at $\ell(p)$ that lies in $\mathbb{R}^{2}_{\geq 0}$ . We denote by $\kappa_{\alpha}^{+}(\cdot)$ the signed curvature of $\alpha$ with respect to the unit normal belonging to $\mathbb{R}^{2}_{\geq 0}$ . We refer to $\kappa_{\alpha}^{+}(\cdot)$ as the curvature with respect to the unit normal vector pointing towards $\mathbb{R}^{2}_{\geq 0}$ .

2.1. Proper losses

Lemma 2.4.

Suppose that $\ell$ in $\mathcal{L}$ is strictly proper, then the signed curvature of the loss curve $\ell(\Delta^{2})$ has a sign. Moreover, its curvature, $\kappa_{\ell}$ , is positive with respect to unit normal vector (field) pointing towards $\mathbb{R}^{2}_{\geq 0}$ .

Proof.

Let $p_{0}\in\textnormal{int}(\Delta^{2})$ and let $\Phi\colon I\subset\mathbb{R}\longrightarrow\Delta^{2}$ be a parametrization of $\Delta^{2}$ around $p_{0}=\Phi(t_{0})$ , for some $t_{0}\in I$ , which we use to obtain a parametrization of $\Delta^{2}\times\Delta^{2}$ around $(p_{0},p_{0})$ 111Notice that this particular choice of coordinates around $(p_{0},p_{0})$ suffices since we want to conclude something about the curvature of the curve loss $\ell$ .. We consider the local expression of $L$ given by

[TABLE]

Using strict properness we know that fixing $t$ , the function $\widetilde{L}(t,\cdot)$ achieves a minimum at $s=t$ (and it is the only one), that is

[TABLE]

To compute the sign of the signed curvature of $\ell(\Delta^{2})$ it is enough to determine the sign of $\widetilde{\ell}_{1}^{\prime}(t)\widetilde{\ell}_{2}^{\prime\prime}(t)-\widetilde{\ell}_{1}^{\prime\prime}(t)\widetilde{\ell}_{2}^{\prime}(t)$ . Without loss of generality, assuming $\Phi_{2}\neq 0$ on this coordinate neighborhood we can write

[TABLE]

where we have used \[email protected] and \[email protected]. Notice that if $\widetilde{\ell}_{1}^{\prime}(t)=0$ for some $t$ then necessarily $\widetilde{\ell}_{2}^{\prime}(t)=0$ by \[email protected], which is impossible in $\mathcal{L}$ . Therefore $\widetilde{\ell}_{1}^{\prime}$ has a sign and this sign determines the sign of the signed curvature of $\ell(\Delta^{2})$ .

For the second statement, notice that again using \[email protected] we know that $\widetilde{\ell}_{1}^{\prime}$ and $\widetilde{\ell}_{2}^{\prime}$ have different signs (and they do not change). If $\widetilde{\ell}_{1}^{\prime}>0$ , then that means that the first coordinate increases and the second decreases, hence $\mathbf{n}(t)$ points towards $\mathbb{R}^{2}_{\geq 0}$ and $\kappa_{\widetilde{\ell}}>0$ . If $\widetilde{\ell}_{1}^{\prime}<0$ , then we are in the opposite case and in this case $\mathbf{n}(t)$ points to $\mathbb{R}^{2}_{\geq 0}$ and $\kappa_{\widetilde{\ell}}<0$ , thus the signed curvature with respect to $-\mathbf{n}(t)$ (the unit normal pointing towards $\mathbb{R}^{2}_{\geq 0}$ ) is positive. ∎

From the proof of the previous theorem we obtain the following corollary.

Corollary 2.5.

Let $\ell\in\mathcal{L}$ . If $\ell$ is proper, then $p\in\textnormal{int}(\Delta^{2})$ is normal to the loss curve $\ell(\Delta^{2})$ at $\ell(p)$ .

Proof.

It follows directly from \[email protected], since for fixed $p\in\textnormal{int}(\Delta^{2})$ , $\langle\ell(q),p\rangle$ attains a minimum at $p$ . ∎

Lemma 2.6.

In $\mathcal{L}$ , proper implies strictly proper.

Proof.

Let $p\in\textnormal{int}(\Delta^{2})$ , and suppose that there is $p^{*}\neq p$ in $\textnormal{int}(\Delta^{2})$ , such that

[TABLE]

Using \[email protected], we see that $p^{*}$ is normal to $\ell$ at $\ell(p)$ , and hence $p$ and $p^{*}$ are parallel. Since both belong to $\Delta^{2}$ , it follows that $p^{*}=p$ , which is a contradiction. ∎

Therefore, in what follows (as long as we stay within $\mathcal{L}$ ) we will use proper and strictly proper interchangeably.

Note that the converse of Lemma 2.4 does not hold. That is, there are $\ell\in\mathcal{L}$ which have positive signed curvature (with respect to the unit normal pointing towards $\mathbb{R}^{2}_{\geq 0}$ ), but are not proper. Indeed let $\ell$ be defined as

[TABLE]

Taking the (standard) parametrization $\Phi_{\textnormal{std}}(t)=(t,1-t)$ we see that the loss curve $\widetilde{\ell}$ goes from left to right so $\mathbf{n}_{\widetilde{\ell}(t)}$ points towards $\mathbb{R}^{2}_{\geq 0}$ . Moreover, we can readily see that the (signed) curvature $\kappa_{\widetilde{\ell}}$ is positive. However, $\Phi_{\textnormal{std}}(t)$ is not normal to $\widetilde{\ell}$ at $\widetilde{\ell}(t)$ , thus by Corollary 2.5, $\ell$ can not be proper.

Therefore, we obtain the following characterization of proper losses in $\mathcal{L}$ .

Lemma 2.7.

Let $\ell\in\mathcal{L}$ . $\ell$ is strictly proper if and only if $p$ is normal to the loss curve $\ell(\Delta^{2})$ at $\ell(p)$ for all $p\in\textnormal{int}(\Delta^{2})$ and the signed curvature of $\ell(\Delta^{2})$ with respect to the normal vector pointing towards $\mathbb{R}^{2}_{\geq 0}$ is positive at all points $\ell(p)$ for $p\in\textnormal{int}(\Delta^{2})$ .

Proof.

The “if” part is Lemma 2.4. For the “only if” part, let $\ell\in\mathcal{L}$ be such that

[TABLE]

where $\kappa_{\ell}^{+}$ is the signed curvature of $\ell$ with respect to the unit normal pointing towards $\mathbb{R}^{2}_{\geq 0}$ . Let $p\in\textnormal{int}(\Delta^{2})$ and let $\Phi$ be a parametrization around $p$ . We readily see that \[email protected] implies that

[TABLE]

while \[email protected] implies $\partial_{ss}\widetilde{L}(t,s)|_{s=t}>0$ by the proof of Lemma 2.4. This implies that fixing $t$ , $\widetilde{\underline{L}}$ achieves its minimum at $s=t$ . Then $\ell$ is proper and by Lemma 2.6, we conclude it is strictly proper. ∎

Remark 2.8.

Notice that to check whether a given loss function $\ell\in\mathcal{L}$ is proper or not, it suffices to do it in any coordinate system. That is, given $\Phi$ , we check conditions \[email protected] and \[email protected] for $\widetilde{\ell}=\ell\circ\Phi$ .

2.2. Mixable loss functions

We say that a loss function $\ell$ is fair if $\ell_{1}(p)\to 0$ as $p\to(0,1)$ and $\ell_{2}(p)\to 0$ as $p\to(1,0)$ (this is motivated by the interpretation when using the standard parametrization, see [RW10]). In addition, recall that a loss function $\ell\in\mathcal{L}$ is proper if and only if

(i)

$\mathbf{n}_{\ell(p)}=\frac{p}{|p|}$ can be chosen, and 2. (ii)

$\kappa_{\ell}^{+}(p)>0$

for all $p\in\textnormal{int}(\Delta^{2})$ .

Thus, a prototype of a fair proper loss function is shown in Figure 4.

Recall from Section 1 that mixability is defined in terms of the superprediction set $\textnormal{spr}(\ell)$ of $\ell$ . More precisely, for $\eta>0$ , consider the set

[TABLE]

where $E_{\eta}\colon\mathbb{R}^{2}_{\geq 0}\longrightarrow[0,1]^{2}$ is the exponential projection \[email protected]. Then, $\ell$ is $\eta$ -mixable if and only if $E_{\eta}(\textnormal{spr}(\ell))$ is convex.

Remark 2.9.

We stress the fact that this definition depends on the superprediction set of $\ell$ rather than on $\ell$ itself – two different loss functions with the same superprediction set will be equally mixable. From our perspective, when talking about mixability of the map $\ell$ (i.e., without making reference to the superprediction set), we see that we can define it as follows. A loss $\ell$ is mixable if the 1-dimensional manifold $E_{\eta}\circ\ell(\textnormal{int}(\Delta^{2}))$ has signed curvature $\kappa^{+}_{E_{\eta}\circ\ell}\leq 0$ . We will adopt the latter version here. Although clearly these definitions are equivalent, it is useful to have this at hand to relate mixability with properness. For now on, when we say $\ell$ is mixable we mean in the latter way. See Figure 5.

We close this part by describing the log loss, which will play an important role. Let $\ell_{\log}\colon\Delta^{2}\longrightarrow\mathbb{R}^{2}$ , given by

[TABLE]

Let $\Phi=\Phi_{\textnormal{std}}$ . Then

[TABLE]

Since $\widetilde{\ell}_{\log}^{\prime}(t)=\left(-t^{-1},(1-t)^{-1}\right)$ , its canonical normal vector is

[TABLE]

The curvature with respect to $-\mathbf{n}_{\widetilde{\ell}_{\log}(t)}$ , the normal vector pointing towards $\mathbb{R}^{2}_{\geq 0}$ , is then given by

[TABLE]

When there is no risk of confusion with denote $\kappa^{+}_{\widetilde{\ell}_{\log}}$ simply as $\kappa^{+}_{\log}$ .

2.3. Mixability and curvature

Haussler, Kivinen and Warmuth in [HKW95] gave a characterization of the mixability constant of a mixable proper binary loss function $\ell$ in terms of the first and second derivatives of its partial losses. We reprove this characterization from a geometric point of view, that is, independent of the parametrization chosen for $\Delta^{n}$ .

Let $\ell\in\mathcal{L}$ be proper and $\Phi$ a 1-chart parametrization222This means that the map $\Phi\colon D\longrightarrow\Delta^{2}$ is such that $\Phi(D)=\Delta^{2}$ . of $\Delta^{2}$ , then $E_{\eta}(\textnormal{spr}(\ell))$ will be convex if and only if the curve $\gamma(t)=E(\ell(\Phi(t)))$ has negative curvature with respect to the unit normal pointing towards $\mathbb{R}^{2}_{\geq 0}$ . Since $\ell$ is proper we can assume without loss of generality that $\kappa_{\ell}(p)=\kappa_{\ell}^{+}(p)>0$ . We are then interested in computing the signed curvature of

[TABLE]

and showing that $\kappa_{g}\geq 0$ . We have

[TABLE]

and

[TABLE]

and thus we have

[TABLE]

Note that the sign of $\widetilde{\ell}_{1}^{\prime}(t)\widetilde{\ell}_{2}^{\prime\prime}(t)-\widetilde{\ell}_{2}^{\prime}(t)\widetilde{\ell}_{1}^{\prime\prime}(t)$ is the sign of $\kappa_{\widetilde{\ell}}$ . If $\kappa_{\widetilde{\ell}}$ is positive, then one can check that $\widetilde{\ell}_{1}^{\prime}(t)>0$ and $\widetilde{\ell}_{2}^{\prime}(t)<0$ , thus the first term in brackets is necessarily negative. Thus by making $\eta$ large $\kappa_{g}(t)$ will become negative. Then we want

[TABLE]

that is,

[TABLE]

When considering the case when the signed curvature is negative, we have:

Lemma 2.10.

Suppose that $\ell\in\mathcal{L}$ is a proper loss function. Then, if $\ell$ is mixable, for any 1-chart parametrization $\Phi$ of $\Delta^{2}$ , the mixability constant is given by

[TABLE]

Conversely, if \[email protected] holds, then $\ell$ is mixable with mixability constant $\eta_{\ell}^{*}$ .

By the local nature of curvature, it would be possible to consider a “local version” of Lemma 2.10, which would characterize a “local” notion of mixability. This alternative will not be pursued here.

In [Vov15], Vovk observes that mixability for proper losses is equivalent to a quotient of curvatures being bounded away from zero. For the reader’s convenience we prove this statement. To recover Vovk’s statement observe that the properties he imposes on the loss functions imply that $\kappa^{+}$ is the signed curvature (see Section 1.6).

Lemma 2.11.

A proper loss function $\ell\in\mathcal{L}$ is mixable if and only if

[TABLE]

where $\kappa_{\log}^{+}$ denotes the curvature of $\ell_{\log}$ . Moreover, when this holds,

[TABLE]

Proof.

By Lemma 2.10, $\ell$ is proper with mixability constant $\eta^{*}_{\ell}>0$ if and only if

[TABLE]

for any given 1-chart parametrization $\Phi$ . Setting $\Phi=\Phi_{\textnormal{std}}$ and using \[email protected], we have the following. For any $t\in\Phi^{-1}(\textnormal{int}(\Delta^{2}))$ ,

[TABLE]

where we used that by properness $\widetilde{\ell}_{2}^{\prime}(t)=-\frac{t}{1-t}\widetilde{\ell}_{1}^{\prime}(t)$ (see \[email protected]).

Since $\kappa^{+}$ is independent of the parametrization, we obtain the result. ∎

Remark 2.12.

Lemma 2.11 exemplifies the usefulness of $\Phi_{\textnormal{std}}$ . The curvature of $\ell_{\log}$ is easily computed with respect to the standard parametrization, by fixing $\Phi=\Phi_{\textnormal{std}}$ we can easily recognize when the curvature of $\ell_{\log}$ appears in our computation. However, since curvature is a geometric quantity we know this relation between curvatures will hold for any parametrization too.

Using this point of view, the following observations enlighten why the weight function in [BSS05] and in [RW10] basically encodes all the relevant information in the binary case. Recall that given a proper loss function $\ell$ , the weight of $\ell$ (with respect to a local parametrization $\Phi$ of $\Delta^{2}$ ) is defined as

[TABLE]

We stress that the weight depends on the coordinates $\Phi$ of $\Delta$ that we use, and hence we use the notation $\ell_{\Phi}$ . As observed in Remark 2.12, we sometimes set $\Phi=\Phi_{\textnormal{std}}$ (as it is done in [BSS05, RW10]) to be able to recognize some terms.

Lemma 2.13.

Let $\ell\in\mathcal{L}$ be a proper loss and $\Phi$ a local parametrization of $\Delta^{2}$ , denote by $\widetilde{\ell}_{\Phi}$ its local expression and by $w_{\ell_{\Phi}}$ be its weight. Then we have for any $t\in\Phi^{-1}(\textnormal{int}(\Delta^{2}))$ ,

[TABLE]

and moreover, if $\lambda$ is another proper loss,

[TABLE]

In particular, when $\Phi=\Phi_{\textnormal{std}}$ ,

[TABLE]

and if in addition, $\lambda=\ell_{\log}$ (with $\Phi=\Phi_{\textnormal{std}}$ ),

[TABLE]

Proof.

Let $\ell\colon\Delta^{2}\longrightarrow\mathbb{R}^{2}_{\geq 2}$ be a proper loss and let $\Phi$ be any parametrization of $\Delta^{2}$ around $p$ . Let us compute $\kappa_{\widetilde{\ell}_{\Phi}}^{+}$ (assuming w.l.o.g. that $\kappa_{\widetilde{\ell}_{\Phi}}^{+}=\kappa_{\widetilde{\ell}_{\Phi}}$ , which means $\widetilde{\ell}_{1}^{\prime}>0$ and $\Phi_{1}^{\prime}<0$ ).

[TABLE]

where we have used that by properness we know that $\langle\widetilde{\ell}^{\prime}(t),\Phi(t)\rangle=0$ (\[email protected]), which implies $\langle\widetilde{\ell}^{\prime\prime}(t),\Phi(t)\rangle=-\langle\widetilde{\ell}^{\prime}(t),\Phi^{\prime}(t)\rangle$ by differentiating with respect to $t$ from the third to the fourth equality, and that $\Phi_{1}^{\prime}(t)+\Phi_{2}^{\prime}(t)=0$ since $\Phi(t)\in\Delta^{2}$ from the third to last to the second to last equality.

Notice that in the last equation of the previous string of equalities, the only term involving $\ell$ is $\widetilde{\ell}_{1}^{\prime}$ (or more precisely $w_{\ell_{\Phi}}(t)$ ) and the remaining terms depend only on the parametrization $\Phi$ . Then we obtain

[TABLE]

The remaining statements follow from setting $\Phi=\Phi_{\textnormal{std}}$ and \[email protected]. ∎

Remark 2.14.

Combining Lemma 2.11 and \[email protected], we recover the characterization of the mixability constant in terms of the quotient of weights obtained by van Erven–Reid–Williamson in [vERW12, Section 4.1]. However for the corresponding statement involving the quotient of second derivatives of the Bayes risks, the fact that $\Delta^{2}$ has an affine parametrization is important. Indeed, this relies on Corollary 3 in [RW10] that states that $w(t)=-\widetilde{\underline{L}}^{\prime\prime}(t)$ . In general, it can be checked that

[TABLE]

which reduces to $w(t)=-\widetilde{\underline{L}}^{\prime\prime}(t)$ when $\Phi=\Phi_{\textnormal{std}}$ . From the point of view of the present work, $\underline{L}$ (or a quotient of them) is not a good quantity to consider since it strongly depends on coordinates. However, notice that if one restricts to affine parametrizations of $\Delta^{2}$ then $\widetilde{\underline{L}}^{\prime\prime}(t)$ depends on $\widetilde{\ell}_{1}^{\prime}(t)^{2}$ and $\kappa_{\widetilde{\ell}}(t)$ and hence in view of Lemma 2.13 restricting to a fixed affine parametrization of $\Delta^{2}$ will make quotients of the second derivative of the Bayes risk well behaved.

Let us remark some points about Lemma 2.13.

•

Let $\ell\colon\Delta^{2}\longrightarrow\mathbb{R}$ be a given strictly proper, fair, loss function. Given a parametrization, we obtain a weight $w_{\ell_{\Phi}}$ given by \[email protected], that is, the weight depends on the parametrization.

•

The curvature of $\ell$ is independent of $\Phi$ up to a sign. However, when defining $\kappa_{\ell}^{+}$ we made the choice of the sign in a uniform way, thus the curvature is independent of the parametrization for the family of losses considered here. Then it follows that the quotient of curvatures is independent of the parametrization and by \[email protected], it also follows that the quotient of weights is also independent of the coordinates (despite the weights being coordinate dependent themselves).

•

A corresponding notion of weight in higher dimensions (for the multi-class case) is way more complicated and it is unclear whether using them would lead to successful results. One higher dimensional analog of curvatures is readily seen to be the so called “principal curvatures” of a hypersurface in Euclidean space (see Appendix A). This will be the main motivation when dealing with the multi-class case (Section 3) Alternative ways to characterize proper higher dimensional loss functions have been studied in [WVR16].

2.4. Geometric comparison of loss functions

Fix a proper, fair loss function $\lambda\colon\Delta^{2}\longrightarrow\mathbb{R}^{2}$ . Given another proper, fair loss function $\ell\neq\lambda$ , how might we compare them? From the point of view of differential geometry, since given $p$ the normal vectors at $\lambda(p)$ and $\ell(p)$ coincide, it is natural to look at their curvatures. Motivated by Lemma 2.11, we impose (for the moment) the condition

[TABLE]

Note that this implies that $\kappa_{\ell}^{+}(p)\geq\kappa_{\lambda}^{+}(p)$ for all $p\in\Delta^{2}$ . We divide the comparison in steps for clarity.

(1)

Expressing $\lambda(\Delta^{2})$ as a function. Note that since $\lambda$ is proper and fair, the normal vector to a point $\lambda(p)$ can only be $(1,0)$ when $p=(1,0)$ (i.e., when evaluating $\lambda$ at the boundary of $\Delta^{2}$ ). Thus, the set $\lambda(\textnormal{int}(\Delta^{2}))$ can be expressed as a graph over the $x$ -axis. To obtain an explicit expression let $\Phi=\Phi_{\textnormal{std}}$ . We use the fact that $\widetilde{\lambda}_{1}\colon(0,1)\longrightarrow(0,l_{1})$ (where $l_{1}$ could be infinity) is invertible. Then, we have that

[TABLE]

where $f(x)=\lambda_{2}(\widetilde{\lambda}_{1}^{-1}(x),1-\widetilde{\lambda}_{1}^{-1}(x))$ . 2. (2)

Translating and parametrizing $\ell(\Delta^{2})$ . Let $p_{0}\in\textnormal{int}(\Delta^{2})$ with $\kappa_{\ell}^{+}(p_{0})>\kappa_{\lambda}^{+}(p_{0})$ , if such $p_{0}$ does not exist then $\ell=\lambda$ . We define $\ell^{0}\colon\Delta^{2}\longrightarrow\mathbb{R}^{2}$ by $\ell^{0}(p)=\ell(p)+[\lambda(p_{0})-\ell(p_{0})]$ , i.e., we translate $\ell$ so that it coincides with $\lambda$ at $\lambda(p_{0})$ . ( $\ell_{0}$ is not fair anymore, however, the curvature is invariant under translations.)

We now parametrize $\ell(\Delta^{2})$ as the graph of a function $g$ defined on an interval $I_{0}$ around $x_{0}$ (the $x$ -coordinate of $\lambda(p_{0})$ ), “aligning” it with $\lambda$ (we can assume this interval to be maximal). We let $g(x)=\ell_{2}^{0}((\widetilde{\ell}_{1}^{0})^{-1}(x),1-(\widetilde{\ell}_{1}^{0})^{-1}(x))$ . Since $\kappa_{\ell}^{+}(p_{0})>\kappa_{\lambda}^{+}(p_{0})$ , we know that around $x_{0}$ the graph of $g$ is to the northeast of $f$ . 3. (3)

Comparison. If the graph of $g$ is to the northeast of $f$ on the whole $I_{0}$ , then we see that the superprediction set of $\ell^{0}$ is contained in that of $\lambda$ . If this does not hold, it means that there is $x_{1}\in I_{0}$ such that $f(x_{1})=g(x_{1})$ , and w.l.o.g. we can assume $x_{1}>x_{0}$ . Thus we know that $g(x)-f(x)\geq 0$ on $[x_{0},x_{1}]$ and $g(x)-f(x)=0$ on $\{x_{0},x_{1}\}$ , i.e., the boundary of $[x_{0},x_{1}]$ . Define the second order operator which computes the curvature of the graph $(x,h(x))$ (see \[email protected]):

[TABLE]

Since $\kappa_{\ell}^{+}(p_{0})>\kappa_{\lambda}^{+}(p_{0})$ , we see that $L(g-f)\geq 0$ on $[x_{0},x_{1}]$ . The maximum principle now implies that the supremum of $g-f$ is attained at the boundary on $[x_{0},x_{1}]$ , and hence we know that $f(x)=g(x)$ on $[x_{0},x_{1}]$ , which is a contradiction. Thus the superprediction set $\ell$ is contained in the superprediction set of $\lambda$ (see Section 4).

More generally, if we assume instead that

[TABLE]

for some $\eta>0$ , we see that (see Appendix A) that $\ell_{\eta}(p)=\eta\ell(p)$ satisfies

[TABLE]

That is, we can reproduce the previous analysis with $\ell_{\eta}$ instead of $\ell$ .

The previous discussion motivates right away a comparison between proper, fair loss functions.

Definition 2.15.

Let $\lambda\colon\Delta^{2}\longrightarrow\mathbb{R}^{2}_{\geq 0}$ be a proper, fair loss in $\mathcal{L}$ , which we call a base loss. We say that a proper, fair loss $\ell\colon\Delta^{2}\longrightarrow\mathbb{R}^{2}$ is mixable with respect to $\lambda$ if

[TABLE]

2.5. Mixability and fundamentality as comparison to the log loss

Now, suppose $\ell\in\mathcal{L}$ is proper and fair. Thus, in particular $\kappa_{\ell}^{+}(p)>0$ for all $p\in\textnormal{int}(\Delta^{2})$ . We want to think of mixability as a geometric comparison to the log loss as suggested by Vovk in [Vov15] and give a detailed interpretation of this comparison. We fix the standard parametrization of $\Delta^{2}$ , $\Phi=\Phi_{\textnormal{std}}\colon[0,1]\longrightarrow\Delta^{2}$ , given by

[TABLE]

The log loss in these coordinates is thus given by

[TABLE]

and by \[email protected], its curvature with respect to the unit normal pointing towards $\mathbb{R}^{2}_{\geq 0}$ is given by

[TABLE]

Notice that $\kappa_{\widetilde{\ell}_{\log}}^{+}(t)>0$ for all $t\in(0,1)$ and $\kappa_{\widetilde{\ell}_{\log}}^{+}(t)\to 0$ as $t\to 0$ or $t\to 1$ . Thus, clearly by Lemma 2.7, for any proper subinterval $C$ of $[0,1]$ (cf. [Vov15, Corollary 2]), we have

[TABLE]

Thus, whether a proper, fair loss function $\ell$ is mixable or not will depend of the behavior of the quotient $\kappa_{\ell}^{+}(p)/\kappa_{\log}^{+}(p)$ as $p$ approaches $(0,1)$ and $(1,0)$ . More precisely, we have obtained the following.

Lemma 2.16.

Let $\ell\in\mathcal{L}$ be a proper loss. Then $\ell$ is mixable if and only if

[TABLE]

Motivated by this we make the following definition.

Definition 2.17.

Let $\ell$ be a proper, fair loss function in $\mathcal{L}$ , and $\Phi=\Phi_{\textnormal{std}}$ be the standard parametrization of $\Delta^{2}$ . We say that is $\ell$ $(B_{1},B_{2})$ -logarithmic at the boundary if

[TABLE]

Let us analyze what this means. Suppose that $\ell$ is proper and $(B_{1},B_{2})$ -logarithmic. Then for any $t\in(0,1)$ , using \[email protected] in Lemma 2.13 and \[email protected], we have

[TABLE]

Notice that as $t\to 0^{+}$ ,

[TABLE]

and similarly,

[TABLE]

that is, we are only comparing the rate at which $\ell_{i}$ , $i=1,2$ , go to 0 (since they do by fairness) with the rate at which the log loss does.

In [Vov15], Vovk defines a loss function $\lambda^{*}$ to be fundamental if given a (computable, proper, mixable) loss function $\lambda$ and a data sequence in $\zeta\in\mathbb{Z}^{\infty}$ that is random under $\lambda^{*}$ with respect to a prediction algorithm $F$ , then it is random under $\lambda$ with respect to $F$ . He shows that a fair, mixable $\ell\in\mathcal{L}$ is fundamental if and only if (using the notation in [Vov15])

[TABLE]

Since we have seen that mixability can be regarded as a comparison of curvatures of the loss curve of $\ell$ and that of $\ell_{\log}$ and we have reinterpreted fundamentabiliy as a comparison of $\ell$ and $\ell_{\log}$ near the boundary building on Definition 2.15, we can easily come up with a notion of $\lambda$ -fundamentality.

Definition 2.18.

Let $\lambda$ be a proper, fair loss function in $\mathcal{L}$ . We say that a proper, fair loss function $\ell\in\mathcal{L}$ is $\lambda$ -fundamental if

•

$\ell$ * is mixable with respect to $\lambda$ , and*

•

when $\Phi=\Phi_{\textnormal{std}}$ , we have

[TABLE]

Suppose now that a mixable loss function $\ell\in\mathcal{L}$ is fundamental. Then there exist $\eta,\gamma>0$ such that

[TABLE]

for all $p\in\textnormal{int}(\Delta^{2})$ . This implies that

[TABLE]

for all $p\in\textnormal{int}(\Delta^{2})$ , which readily implies (Appendix A) that

[TABLE]

for all $p\in\textnormal{int}(\Delta^{2})$ .

Rephrasing the previous discussion we have obtained the following characterization of fundamentality.

Theorem 2.19.

A loss function $\ell\in\mathcal{L}$ is fundamental if and only if there exist numbers $\eta,\gamma>0$ , such that for any $p\in\textnormal{int}(\Delta^{2})$ , there are translation vectors $x_{p}$ and $y_{p}$ in $\mathbb{R}^{2}_{\geq 0}$ such that

[TABLE]

2.6. Constructing new mixable losses from previous

We now observe how mixability helps us to construct new proper, fair and mixable functions from previous proper, fair and mixable losses. We first define a family of loses that will serve to illustrate the idea. We set $\Phi=\Phi_{\textnormal{std}}$ and $\lambda=\ell_{\log}$ . Let $a>0$ and define the loss function $\lambda^{a}\colon\Delta^{2}\longrightarrow\mathbb{R}^{2}_{\geq 0}$

[TABLE]

It can be readily checked that $\kappa_{\widetilde{\lambda^{a}}}(t)=a^{-1}\kappa_{\lambda}^{+}(t)$ , thus since

[TABLE]

it follows that $\lambda^{a}$ is 1-mixable for $a\leq 1$ and it is not if $a>1$ . Note that $\lambda^{a}$ is still proper and fair. Take then $a<1$ , we can readily see that there exists a proper, fair an mixable loss function $\lambda^{*}$ such that

[TABLE]

Indeed, $\lambda^{*}=\lambda-\lambda^{a}=\lambda^{1-a}$ , which is fair, proper and 1-mixable.

This process works in a more general setting than scalings of $\lambda$ . Consider for example the spherical loss $\sigma$ defined in coordinates by

[TABLE]

It can be easily checked that this is bounded, proper and fair and that $\kappa_{\widetilde{\sigma}}(t)=1$ . Thus

[TABLE]

thus $\sigma$ is 1-mixable. Thus, as before, there is a loss function $\ell^{*}$ such that $\lambda=\sigma+\ell^{*}$ . Moreover, the loss function given (in coordinates) by

[TABLE]

which can be seen to be unbounded, proper, fair and mixable.

We close this part with the following observation. Suppose that $\ell$ is a proper, fair, mixable loss function with mixability constant $\eta>0$ . Then the loss function $\ell^{\eta}=\eta\ell$ is 1-mixable. Thus, there exists a proper, fair, mixable loss $\ell^{*}$ such that

[TABLE]

As we will see in Section 4, the previous observation can be interpreted from the point of view of the superprediction sets of the involved loss functions and convex geometry: $\textnormal{spr}(\eta\ell)$ slides freely inside $\textnormal{spr}(\lambda)$ (see Theorem 4.23).

2.7. Composite losses and the canonical link

In this part we discuss composite losses following [RW10]. Let us recall their setting. Let $\mathcal{V}\subset\mathbb{R}$ be a set of prediction values. A link function is a continuous map $\psi\colon[0,1]\longrightarrow\mathcal{V}$ . Given a loss function $\widetilde{\varrho}\colon\{0,1\}\times[0,1]\longrightarrow\mathbb{R}$ and assuming $\mathcal{V}=\mathbb{R}$ , if $\psi$ is invertible, we define the composite loss $\varrho^{\psi}$ as

[TABLE]

Definition 2.20.

A composite loss $\widetilde{\varrho}^{\psi}$ is a proper composite loss if $\widetilde{\varrho}$ is a proper loss in the sense of [RW10].

Recall that in [RW10], $\Phi=\Phi_{\textnormal{std}}$ is implicitly assumed. Then, given a loss function $\widetilde{\varrho}$ (in the [RW10] sense), we can construct a loss function $\varrho\colon\Delta^{2}\longrightarrow\mathbb{R}^{2}_{\geq 0}$ , by $\varrho=\widetilde{\varrho}\circ\Phi_{\textnormal{std}}^{-1}$ . Then, the composite loss $\widetilde{\varrho}^{\psi}$ can be expressed as

[TABLE]

In other words, the composite loss $\widetilde{\varrho}^{\psi}$ is the local expression of $\varrho$ with respect to the parametrization $\Phi=\Phi_{\textnormal{std}}\circ\psi^{-1}$ of $\Delta^{2}$ . We denote the local expression of $\varrho$ with respect to $\Phi$ by $\widehat{\varrho}$ , that is $\widehat{\varrho}\coloneqq\varrho\circ\Phi_{\textnormal{std}}\circ\Psi^{-1}=\widetilde{\varrho}\circ\Psi^{-1}$

To show how this reconciliation of terms work, we obtain a result similar to Corollary 12 in [RW10]. Suppose that a composite loss $\widetilde{\varrho}^{\psi}$ is given and it has differentiable partial losses (i.e., the corresponding loss $\varrho$ is in $\mathcal{L}$ ), furthermore, we assume that $\psi$ is a diffeomorphism which in one dimension means it is strictly monotonic. Then we know that $\widetilde{\varrho}^{\psi}$ is strictly proper if and only if $\varrho$ is strictly proper (by definition). This implies that $p$ is normal to $\varrho(\Delta^{2})$ at $\varrho(p)$ for all $p\in\textnormal{int}(\Delta^{2})$ and its curvature is positive (with respect to the unit normal pointing towards $\mathbb{R}^{2}_{\geq 0}$ ). This means for all $v\in\mathcal{V}$ ,

[TABLE]

where we have used that $\psi$ is a diffeomorphism and that $\Phi_{1}+\Phi_{2}=1$ for all parametrizations $\Phi$ of $\Delta^{2}$ . Therefore, we have

[TABLE]

that is

[TABLE]

for all $v\in\mathcal{V}$ .

Since we are working with valid reparametrizations the choice of $\Psi$ will not affect the curvature of $\varrho$ . Hence we obtain

Corollary 2.21.

A composite loss $\widetilde{\varrho}^{\psi}$ is strictly proper if and only if $\varrho\in\mathcal{L}$ is strictly proper and $\psi$ satisfies

[TABLE]

for all $v\in\mathbb{R}$ .

Remark 2.22.

We have seen that whether a loss function $\ell\in\mathcal{L}$ is strictly proper or not, depends on whether conditions \[email protected] and \[email protected] hold or not. Notice that under a (admissible) change of coordinates, for example given by a link $\psi$ , \[email protected] will not be modified. However, \[email protected] might change (since in a way, we are changing the “velocity” at which we move on $\ell(\Delta^{2})$ ). Hence, Corollary 2.21 is giving us a way to define the set of admissible links (or reparametrizations of $\Delta^{2}$ ) given a loss function $\ell$ and the standard parametrization of $\Delta^{2}$ . In this case, the new parametrization is given by $\Phi=\Phi_{\textnormal{std}}\circ\psi^{-1}$ .

For applications, it is desired to be able to work with a given composite loss $\widetilde{\varrho}^{\psi}$ , and moreover, to have convexity of the partial losses $\widetilde{\varrho}^{\psi}_{1}$ and $\widetilde{\varrho}^{\psi}_{2}$ . From our point of view, we see $\widetilde{\varrho}^{\psi}$ as the local expression of some $\varrho\colon\Delta^{2}\longrightarrow\mathbb{R}$ , so that $\widehat{\varrho}\coloneqq\varrho\circ\Phi=\varrho\circ\left(\Phi_{\textnormal{std}}\circ\psi^{-1}\right)=\left(\varrho\circ\Phi_{\textnormal{std}}\right)\circ\psi^{-1}=\widetilde{\varrho}\circ\psi^{-1}$ .

Let us work with the partial losses separately:

[TABLE]

Proceeding as in the proof of Lemma 2.4, properness implies

[TABLE]

or, equivalently,

[TABLE]

Therefore, we can define $w$ as

[TABLE]

where $w_{\widetilde{\varrho}}$ is the weight of $\widetilde{\varrho}$ , we can rewrite the derivatives of the partial losses of $\widehat{\varrho}$ as

[TABLE]

Taking second derivatives we have

[TABLE]

A way to guarantee both expressions are positive is as follows. Assume w.l.o.g. that $(\psi^{-1})^{\prime}>0$ . Since we are assuming $w>0$ , $\widehat{\varrho}_{2}$ is increasing and $\widehat{\varrho}_{1}$ is decreasing (also we have $\Phi_{1}$ is increasing and $\Phi_{2}$ is decreasing). We readily see that imposing

[TABLE]

for all $v\in\mathbb{R}$ , is enough to guarantee both second derivatives to be strictly positive.

Definition 2.23.

Given $\varrho\in\mathcal{L}$ strictly proper, we define the canonical link $\psi$ as the link defined by

[TABLE]

for $v\in\mathcal{V}$ , where $w$ is defined in \[email protected].

The differential equation \[email protected] can be seen as separable ordinary differential equation, which is solvable for loss functions in $\mathcal{L}$ .

To give a geometric meaning, we look at the norm of the velocity of the loss curve $\alpha(v)=\widehat{\varrho}(v)$ .

[TABLE]

By assuming $w(s)(\psi^{-1})^{\prime}(s)=1$ and $\Phi=\Psi$ , we have

[TABLE]

Thus the canonical link gives a parametrization of $\Delta^{2}$ such that $\widehat{\varrho}$ is a curve such that its velocity vector at $v$ coincides with the length of the vector $\Phi(\psi^{-1}(v))$ . In other words, it is a parametrization of the loss curve $\varrho(\textnormal{int}(\Delta^{2}))$ such that for $\varrho(p)=\widehat{\varrho}(v)\in\ell(\textnormal{int}(\Delta^{2}))$ , the tangent vector at the point has length $|p|$ . We close this discussion with a charcterization of the canonical link.

Theorem 2.24.

Let $\varrho\in\mathcal{L}$ be a stxrictly proper loss function and $\psi$ its canonical link. The reparametrization of $\varrho$ determined by its canonical link is a parametrization of $\varrho(\textnormal{int}(\Delta^{2}))$ with weight equal to 1.

Proof.

Let $\widehat{\varrho}=\varrho\circ(\Phi_{\textnormal{std}}\circ\psi^{-1})=\widetilde{\varrho}\circ\psi^{-1}$ be the reparametrization of $\varrho(\textnormal{int}(\Delta^{2}))$ determined by the canonical link. Since

[TABLE]

for all $v\in\mathcal{V}$ , and from Definition 2.23

[TABLE]

for all $v\in\mathcal{V}$ , we have

[TABLE]

Thus $w_{\widehat{\varrho}}(v)=\left|\frac{\widehat{\varrho}_{2}^{\prime}(v)}{\Phi_{1}(v)}\right|=1$ . ∎

3. Mixability for Multi-Class Classification

Now we focus our attention on multi-class classification loss functions, that is, maps $\ell\colon\Delta^{n}\longrightarrow\mathbb{R}^{n}{\geq 0}$ given by the partial losses

[TABLE]

Our main goal is to interpret mixability as a geometric comparison of a given loss function $\ell$ to the log loss, as we did for the binary case. As suggested by the comments after Remark 2.11, the extra work of characterizing properness and mixability in a geometric way (coordinate independent) will pay off since to carry out the comparison we will look at the scalar second fundamental forms of $\ell(\textnormal{int}(\Delta^{n}))$ and $\ell_{\log}(\textnormal{int}(\Delta^{n}))$ . The scalar second fundamental form measures how a Riemannian manifold curves inside an “ambient space”, in this case how $\ell(\textnormal{int}(\Delta^{n}))$ curves inside $\mathbb{R}^{n}$ (see Appendix A for details).

The definition of $\mathcal{L}$ (Definition 2.1) can be extended to higher dimensions.

Definition 3.1.

An admissible loss function is a map $\ell\colon\Delta^{n}\longrightarrow\mathbb{R}^{n}_{\geq 0}$ such that

(i)

$\ell(\textnormal{int}(\Delta^{n}))\subset\mathbb{R}^{n}_{\geq 0}$ * is a $(n-1)$ -manifold of class $C^{2}$ ,* 2. (ii)

there exists a differentiable map $\mathbf{n}:\ell(\textnormal{int}(\Delta^{n}))\to N\ell(\textnormal{int}(\Delta^{n}))$ , $\mathbf{n}(\ell(p))=\mathbf{n}_{\ell(p)}$ , where $N\ell(\textnormal{int}((\Delta^{n}))$ is the normal space of $\ell(\textnormal{int}((\Delta^{n}))$ , and 3. (iii)

$\mathbf{n}(p)$ * or $-\mathbf{n}(p)$ belongs to $\mathbb{R}^{n}_{>0}$ for all $p\in\textnormal{int}(\Delta^{n})$ .*

We denote the set of admissible loss functions as $\mathcal{L}_{n}$ , or simply $\mathcal{L}$ when the dimension is clear from context.

We fix the log loss and denote it for convenience by $\lambda\coloneqq\ell_{\log}\colon\Delta^{n}\longrightarrow\mathbb{R}^{n}_{\geq 0}$ , as the map

[TABLE]

for $p=(p_{1},...,p_{n})\in\Delta^{n}$ .

Let $\ell\in\mathcal{L}_{n}$ and consider a parametrization $\Phi\colon D\subset\mathbb{R}^{n-1}\longrightarrow\Delta^{n}$ of $\Delta^{n}$ around $p\in\textnormal{int}(\Delta^{n})$ . The local expression of the conditional risk (using the parametrization $\Phi\times\Phi$ of $\Delta^{n}\times\Delta^{n}$ around $(p,p)$ ) is given by

[TABLE]

where $t=(t_{1},...t_{n-1}),s=(s_{1},...,s_{n-1})\in D$ and $\widetilde{\ell}=\ell\circ\Phi$ .

Imposing $\ell$ to be proper implies that when fixing $t$ , $s=t$ is a critical point of $\widetilde{L}(t,\cdot)$ , that is,

[TABLE]

for all $i\in\{1,...,n-1\}$ . Note that since the tangent space of $M_{\ell}$ at $\widetilde{\ell}(t)$ , $T_{\widetilde{\ell}(t)}\widetilde{\ell}(U)$ , is generated by $\{\partial_{s_{1}}\widetilde{\ell}(t),...,\partial_{s_{n-1}}\widetilde{\ell}(t)\}$ , we conclude that $\Phi(t)$ is a normal vector. In other words, as before, we have

[TABLE]

for all $p\in\textnormal{int}(\Delta^{n})$ .

The fact that $\widetilde{L}(t,\cdot)$ achieves a minimum at $s=t$ (at interior points) is equivalent to requiring that the Hessian, $D^{2}\widetilde{L}$ , is positive definite at $s=t$ . The Hessian of $\widetilde{L}(t,\cdot)$ at $s=t$ is given by

[TABLE]

The next step is to relate $[D^{2}\widetilde{L}]_{ij}(t)$ to the scalar second fundamental form $h$ of $M_{\ell}=\ell(\Delta^{n})$ (see Appendix A for its definition). More precisely, we compute the $h$ with respect to a local parametrization $\Phi$ of $\Delta^{n}$ , i.e., we obtain the matrix $[h_{ij}]$ representing $h$ . To do this we need to compute the second derivatives of its parametrization $\widetilde{\ell}=\ell\circ\Phi$ (Appendix A). Since,

[TABLE]

we have

[TABLE]

The scalar second fundamental form (with respect to the normal vector pointing towards $\mathbb{R}^{n}_{\geq 0}$ ) is then given by

[TABLE]

for $i,j=1,...,n-1$ , thus if $[D^{2}\widetilde{L}]_{ij}(s)$ is positive definite, then the matrix $[h_{ij}](s)$ is positive definite. In this case its eigenvalues are strictly positive and hence, the principal curvatures of $M_{\ell}$ at $\ell(s)$ (see Appendix A), $\kappa_{i}^{+}(s)$ (with respect to the unit normal pointing towards $\mathbb{R}^{n}_{\geq 0}$ ) are all positive. Therefore, using a similar reasoning as we did in the case $n=2$ , we have obtained the following geometric characterization of properness (by following the same arguments as in Section 2).

Lemma 3.2.

Let $\ell\in\mathcal{L}_{n}$ . $\ell$ is strictly proper if and only if $\mathbf{n}_{\ell}(p)=\pm p/|p|$ and the principal curvatures of $M_{\ell}$ at $\ell(p)$ , $\kappa_{i}^{+}(p)$ ( $i=1,..,n-1$ ), are strictly positive for all $p\in\textnormal{int}(\Delta^{n})$ .

We briefly explain how the comparison of scalar second fundamental forms will be performed. We follow a similar procedure as the one described in Section 2.4 for the case $n=2$ .

(1)

We establish that given a proper loss function $\ell\in\mathcal{L}_{n}$ , around every $p^{*}\in\textnormal{int}(\Delta^{n})$ , $\ell(\textnormal{int}(\Delta^{n}))$ can be parametrized as a graph of a function $f$ defined on a neighborhood around some $x^{*}\in\mathbb{R}^{n}$ such that $(x^{*},f(x^{*}))=\ell(p^{*})$ . We do this explicitly for the log loss $\lambda$ . 2. (2)

Since $\lambda$ and $\ell$ are proper, the normal vector to $\lambda(\textnormal{int}(\Delta^{n}))$ and $\ell(\textnormal{int}(\Delta^{n}))$ at $\lambda(p^{*})$ and $\ell(p^{*})$ , respectively, is $p^{*}/|p^{*}|$ . Hence we can identify their tangent spaces at these points. We do so and fix the parametrizations given in step (1). 3. (3)

By assuming $\eta$ -mixability of $\ell$ , we look at the principal curvatures of $E_{\eta}(\ell(\textnormal{int}(\Delta^{n}))$ and prove an equivalent condition for them to be non-negative with respect to normal vector field pointing towards $E_{\eta}(\textnormal{spr}(\ell))$ (i.e., convexity). The condition to be satisfied is seen to be comparison of the scalar second fundamental forms of $\lambda$ and $\ell$ that we can recognize by step (1). 4. (4)

We interpret this comparison as follows. Since the tangent spaces to $\ell(p^{*})$ (and $\eta\ell(p^{*})$ ) and $\lambda(p^{*})$ coincide for the chosen point $p^{*}$ , if we translate $\ell$ to coincide to $\lambda$ at $p^{*}$ , call this tangent space $H$ (and note it can be indetified with the supporting plane of the loss functions at the given point). Then if we express (locally) $\eta\ell(\textnormal{int}(\Delta^{n}))$ and $\lambda(\textnormal{int}(\Delta^{n}))$ over $H$ , the graph of $\eta\ell(\textnormal{int}(\Delta^{n}))$ lies above the graph of $\lambda(\textnormal{int}(\Delta^{n}))$ . See Figure 6.

3.1. Representing proper loss functions as graphs over Euclidean spaces

When restricting to the set of admissible loss functions $\mathcal{L}_{n}$ ( $n\geq 2$ ), we can represent losses as functions over $\mathbb{R}^{n-1}$ (a similar approach was taken in [vERW12]; the difference relies on the fact that here we are after the comparison of second fundamental forms), which allows us to represent geometric quantities in a simple way. This will be useful to recognize these quantities when comparing a proper loss function $\ell$ to the log loss $\lambda$ , as we did for the binary case in Section 2. Let $\ell\colon\Delta^{n}\longrightarrow\mathbb{R}^{n}_{\geq 0}$ be a proper loss in $\mathcal{L}_{n}$ given by

[TABLE]

Let $\Phi\colon\Delta^{n-1}\subset\mathbb{R}^{n-1}\longrightarrow\Delta^{n}$ be the standard parametrization of $\Delta^{n}$ given by

[TABLE]

where $s=(s_{1},...,s_{n})\in\Delta^{n-1}$ . The local expression of $\widetilde{\ell}$ in these coordinates is then given by $\widetilde{\ell}(s)=(\ell\circ\Phi)(s)$ , so that $\widetilde{\ell}_{i}(s)=(\ell_{i}\circ\Phi)(s)$ . Also, we define the projection $\Pi\colon\mathbb{R}^{n}_{\geq 0}\longrightarrow\mathbb{R}^{n-1}_{\geq 0}$ as $\Pi(y_{1},...,y_{n})=(y_{1},...,y_{n-1})$ .

Recall that properness implies that the normal vector of $M_{\ell}=\ell(\Delta^{n})$ at $\ell(p)$ can be chosen to be $|p|^{-1}p$ , for $p\in\textnormal{int}(\Delta^{n})$ . As a consequence, the normal vector is never parallel to the hyperplane $\{(x_{1},..,x_{n})\in\mathbb{R}^{n}\,|\,x^{n}=0\}$ , so that around any point $\ell(p)$ with $p\in\textnormal{int}(\Delta^{n})$ , $M_{\ell}$ can be written as a graph over $\mathbb{R}^{n}_{\geq 0}\times\{0\}$ (as regular as $M_{\ell}$ is). In general, the existence of this function is guaranteed by the implicit function theorem, however, in our case we can give an explicit description of it as follows. The function $\Pi|_{M_{\ell}}$ is a map with injective derivative, say around $\ell(q)$ for a fixed $q\in\textnormal{int}(\Delta^{n})$ , therefore, the inverse function theorem ensures the existence (and differentiability) of a local inverse, which we can denote by $\Pi|_{M_{\ell}}^{-1}$ . This inverse map can be seen as a local parametrization of $M_{\ell}$ . Thus, the local expression of $\ell$ (viewed as a map from $\Delta^{n}$ to $M_{\ell}$ ), $\overline{\ell}\colon D_{q}\subset\mathbb{R}^{n-1}\longrightarrow U_{\ell(q)}\subset\mathbb{R}^{n-1}$ (where the latter are small neighborhoods around $\Phi^{-1}(q)$ and $\Pi(\ell(q))$ respectively) is given by

[TABLE]

This map is a diffeomorphism and its inverse $\overline{\ell}^{-1}\colon U_{\ell(q)}\longrightarrow D_{q}$ , will be denoted by

[TABLE]

We warn the reader about this abuse of notation, $\widetilde{\ell}^{-1}_{i}(x)$ is not the inverse of $\widetilde{\ell}_{i}(s)$ , it is a map satisfying

[TABLE]

We want to define $f\colon U_{\ell_{q}}\longrightarrow\mathbb{R}$ such that $\textnormal{graph}(f)\subset M_{\ell}$ . We see that setting $U_{\ell(q)}\subset\Pi(M_{\ell})$ , so that it contains $\Pi(\ell(q))$ , we arrive to

[TABLE]

We have obtained the following result.

Lemma 3.3.

Let $\ell\in\mathcal{L}_{n}$ be a strictly proper loss. Let $q\in\textnormal{int}(\Delta^{n})$ . Then there exists an open set $U\subset\mathbb{R}^{n-1}_{\geq 0}\times\{0\}$ and a function $f\colon U\longrightarrow\mathbb{R}_{\geq 0}$ such that $M_{\ell}$ admits the parametrization

[TABLE]

around $\ell(q)$ .

Let $\ell$ and $f$ be as in Lemma 3.3. The unit normal vector field (pointing towards $\mathbb{R}^{n}_{\geq 0}$ ) is then given by

[TABLE]

We proceed to calculate the scalar second fundamental form. The first and second derivatives of $\Phi^{f}$ are given by

[TABLE]

for $k,m=1,...,n-1$ , where $e_{k}$ denotes the canonical basis of $\mathbb{R}^{n-1}$ and $\mathbf{0}$ is the 0 vector of $\mathbb{R}^{n-1}$ . Denote by $h^{\ell}$ the scalar second fundamental form of $M_{\ell}$ . Thus with respect to this coordinates we have

[TABLE]

for $k,m=1,...,n-1$ .

3.1.1. $M_{\lambda}$ as a graph

Fix an arbitrary point $q^{*}\in\textnormal{int}(\Delta^{n})$ . The local expression of $\lambda$ (with respect to the standard parametrization $\Phi=\Phi_{\textnormal{std}}$ around $q^{*}$ and $\Pi$ around $\ell(q^{*})$ ) is given by

[TABLE]

thus, we have

[TABLE]

Fix $s^{*}=\Phi^{-1}(q^{*})$ . Thus, around $x^{*}=\Pi(\lambda(q^{*}))$ , using Lemma 3.3, $M_{\lambda}$ around $\ell(q)$ can be described as

[TABLE]

Moreover, in this case we have the explicit expression $g(x)=-\ln(1-\sum_{i=1}^{n-1}e^{-x_{i}})$ . Notice that $\overline{\lambda}^{-1}(x^{*})=s^{*}$ . We now compute the scalar second fundamental form $h^{\lambda}$ of $\lambda$ at $x^{*}$ .

[TABLE]

for $k,m=1,...,n-1$ (here $\delta_{km}$ denotes the Kronecker delta). In particular,

[TABLE]

for $k,m=1,...,n-1$ , and since $\mathbf{n}((x^{*},g(x^{*}))=\frac{1}{\sqrt{\sum_{i=1}^{n-1}(s^{*}_{i})^{2}+(1-\sum_{i=1}^{n-1}s^{*}_{i})^{2}}}(s^{*},1-\sum_{i=1}^{n-1}s^{*}_{i})$ we have

[TABLE]

for $k,m=1,...,n-1$

Remark 3.4.

Note that if instead of $\lambda$ we would have used a translation of it, that is, for $c\in\mathbb{R}^{n}$ , define a loss function $\varphi\colon\Delta^{n}\longrightarrow\mathbb{R}^{n}_{\geq 0}$ by

[TABLE]

we can repeat the previous computation. The only difference is that we would have a different point $x_{*}^{c}$ instead of $x_{*}$ .

3.2. Geometric interpretation of mixability

Mixability is defined as a property of the superprediction set of a proper loss $\ell\in\mathcal{L}_{n}$ . More precisely, $\ell$ is mixable if and only if $E_{\eta}(\textnormal{spr}(\ell))$ is convex for some $\eta>0$ . As before, we can determine whether $E_{\eta}(\textnormal{spr}(\ell))$ is convex by looking at its boundary $\partial E_{\eta}(\textnormal{spr}(\ell))=E_{\eta}(\ell(\Delta^{n}))$ . $E_{\eta}(\textnormal{spr}(\ell))$ is convex if the principal curvatures of $E_{\eta}(\ell(\Delta^{n})$ are non-negative (when defined with respect to the inner pointing normal vector) at all points. Since convexity is a global property that can be tested “locally everywhere”, it makes sense to make the following definition.

Definition 3.5 ( $\eta$ -Mixability at $p\in\Delta^{n}$ ).

We say that $\ell\in\mathcal{L}_{n}$ is $\eta$ -mixable at $p\in\textnormal{int}(\Delta^{n})$ if $E_{\eta}(M_{\ell})$ has non-negative principal curvatures with respect to the unit normal vector pointing towards $E_{\eta}(\textnormal{spr}(\ell))$ at $E_{\eta}(\ell(p))$ .

Clearly, $\ell\in\mathcal{L}_{n}$ is $\eta$ -mixable at all $p\in\textnormal{int}(\Delta^{n})$ if and only if it is $\eta$ -mixable.

Let $\ell,\varrho\in\mathcal{L}_{n}$ be strictly proper. First, we note that properness implies that the second fundamental forms of $\ell$ and $\varrho$ can be compared in the following sense. Given $q^{*}\in\Delta^{n}$ , note that the normal vector to $M_{\ell}$ and $M_{\varrho}$ can be chosen to be $q^{*}/|q^{*}|$ . A translation does not affect the geometric properties of $M_{\varrho}$ (since it is an isometry of $\mathbb{R}^{n}$ ), thus we consider the translated loss $\varrho^{\ell(q^{*})}\colon\Delta^{n}\longrightarrow\mathbb{R}^{n}$ , given by

[TABLE]

i.e., we translate $\varrho$ by the vector $c^{q^{*}}=\lambda(q^{*})-\ell(q^{*})$ so that both $\varrho^{q^{*}}$ and $\ell$ coincide when evaluated at $q^{*}$ . Doing so allows us to identify the tangent spaces to $M_{\varrho^{\ell(q^{*})}}$ and $M_{\ell}$ at $\varrho^{\ell(q^{*})}(q^{*})=\ell(q^{*})$ . We will call $\varrho^{\ell(q^{*})}$ the translation of $\varrho$ to $\ell(q^{*})$ .

Lemma 3.6.

Let $\ell\in\mathcal{L}_{n}$ be strictly proper. Let $h^{\ell}$ and $h^{\lambda}$ denote the scalar second fundamental form of $M_{\ell}$ and $M_{\lambda}$ (the log loss), respectively. Then, $\ell$ is $\eta$ -mixable at $p\in\textnormal{int}(\Delta^{n})$ if and only if

[TABLE]

is positive semi-definite, where $h^{\ell}$ and $h^{\lambda}$ denote the second fundamental forms of $\ell$ and $\lambda$ in the graphical coordintes described in Lemma 3.3. And therefore, $\ell$ is $\eta$ -mixable if and only if \[email protected] holds for all $p\in\textnormal{int}(\Delta^{n})$ .

Proof.

Let $\ell\colon\Delta^{n}\longrightarrow\mathbb{R}^{n}_{\geq 0}$ be an admissible proper loss

[TABLE]

The $\eta$ -exponential projection map $E_{\eta}\colon\mathbb{R}^{n}\longrightarrow\mathbb{R}^{n}$ is given by

[TABLE]

Let $q^{*}\in\textnormal{int}(\Delta^{n})$ and write $M_{\ell}$ around $\ell(q^{*})$ as the graph of a function $f$ over $\mathbb{R}^{n-1}$ , defined on an open set $U^{f}_{x^{*}}$ containing $x^{*}$ , such that $f(x^{*})=\ell(q^{*})$ . We can directly give a parametrization of $E_{\eta}(M_{\ell})$ around $E_{\eta}(\ell(q^{*}))=E_{\eta}((x^{*},f(x^{*}))$ by

[TABLE]

We proceed to compute the second fundamental form of $E_{\eta}(M_{\ell})$ around $E_{\eta}(\ell(q^{*}))$ (with respect to the inward pointing unit normal vector). The first and second derivatives of $\Psi$ are given by

[TABLE]

and noting that the (inward pointing) unit vector field is given by

[TABLE]

Therefore, letting $\mathcal{E}^{\eta}\coloneqq\mathcal{E}^{\eta}(U_{f})=E_{\eta}(f(U_{f}))$ , the second fundamental form of $\mathcal{E}^{\eta}$ at $E_{\eta}((x^{*},f(x^{*})))$ is given by

[TABLE]

Thus, since the convexity of $E_{\eta}(\textnormal{spr}(\ell))$ is equivalent to the principal curvatures of $E_{\eta}(M_{\ell})$ being non-negative at $q^{*}$ for all $q^{*}\in\textnormal{int}(\Delta^{n})$ (with respect to the inner pointing normal vector), we see this will be the case if and only if the matrix

[TABLE]

is positive semi-definite for all $x^{*}$ corresponding to $q^{*}\in\textnormal{int}(\Delta^{n})$ .

Note that since we have a graphical parametrization $\Phi^{f}$ of $M_{\ell}$ around $x^{*}\in U$ , we have

[TABLE]

and by \[email protected],

[TABLE]

On the other hand, since the normal vector to $\Phi^{f}(U)$ at $(x^{*},f(x^{*}))$ is $\frac{q^{*}}{|q^{*}|}$ , we have

[TABLE]

for $s^{*}\in\mathbb{R}^{n-1}$ such that $\Phi(s^{*})=q^{*}$ .

By properness we know that

[TABLE]

thus

[TABLE]

and also

[TABLE]

Using \[email protected] and the previous observations, we can rewrite the terms of $A_{km}$ as

[TABLE]

and

[TABLE]

Now, consider the log loss $\lambda$ and its translation to $\ell(q^{*})$ which we denote by $\lambda^{*}$ to simplify the notation. That is, we have

[TABLE]

As discussed in Remark 3.4, we can write $M_{\lambda^{*}}$ as a graph around $x^{*}$ (since $\lambda^{*}(q^{*})=\ell(q^{*})$ ). The scalar second fundamental form of $M_{\lambda^{*}}$ at $\lambda^{*}(q^{*})$ is then given by

[TABLE]

This readily implies that

[TABLE]

Therefore, $\ell$ is $\eta$ -mixable at $q^{*}$ if and only if we have that

[TABLE]

is semi-positive definite. Since $q^{*}$ was arbitrary the result follows. ∎

Remark 3.7.

The previous comparison of second fundamental forms is possible because properness forces the induced metrics by $\ell$ and $\lambda$ to coincide at $\ell(q^{*})=\lambda^{*}(q^{*})$ , that is, $[g^{\ell}_{ij}](x^{*})=[g^{\lambda^{*}}_{ij}](x^{*})$ (see Appendix A and Remark A.3). The conclusion of Theorem 3.6 does not necessarily hold if one takes a different coordinate system.

In order to get a geometric interpretation (i.e., independent of coordinates) we note the following:

[TABLE]

The matrices $[h^{\ell}_{ij}](x^{*})[g^{\ell}]^{-1}(x^{*})$ and $[h^{\lambda^{*}}_{ij}](x^{*})[g^{\lambda^{*}}]^{-1}(x^{*})$ are the local expression of the Weingarten map (see [Lee18] for its definition and properties) of $\ell$ and $\lambda$ respectively. The eigenvalues of these matrices are the principal curvatures of $M_{\ell}$ and $M_{\lambda}$ (and they are independent of coordinates), and the determinants are their Gaussian curvatures. From here it also follows that

[TABLE]

that is,

[TABLE]

where $W^{\ell}$ denotes the Weingarten map of the loss function $\ell$ . Then once a system of coordinates around $p\in\Delta^{n}$ is chosen the relation \[email protected] holds. A priori, the relation obtained between the Weingarten maps of $\ell$ and $\lambda$ does not provide much information, but it does points to look at the loss function $\eta\ell$ . With this in mind Lemma 3.6 does give a direct geometric interpretation as follows. Let $\ell\colon\Delta^{n}\longrightarrow\mathbb{R}^{n}_{\geq 0}$ in $\mathcal{L}$ be a proper loss. Given a point $q\in\Delta^{n}$ we know that around $\ell(q)$ , $M_{\ell}$ can be parametrized with $\Phi^{f}(x)=(x,f(x))$ for some function $f$ around the point $\Pi(\ell(q))$ . Let $x^{*}=\Phi(\ell(q))$ . Consider now the proper loss $\varrho=\eta\ell$ , for some $\eta>0$ . We readily see that $\varrho$ can be parametrized as $\Phi^{g}(y)=(y,g(y))$ with

[TABLE]

with $g$ defined around $y_{q}=\eta x^{*}$ . Now we compute the second fundamental form of $\varrho$ at $y_{q}$ . Notice that

[TABLE]

and hence,

[TABLE]

Then assuming the hypothesis of Lemma 3.6, we obtain

[TABLE]

The supporting planes at $\eta\ell(p)$ and $\lambda(p)$ of $M_{\eta\ell}$ (or more precisely, of its translation to $\lambda(p)$ ) and $M_{\lambda}$ , respectively, coincide (since the normal vectors are the same), we denote it by $H_{p}$ . By looking at $M_{\eta\ell}$ and $M_{\lambda}$ locally as graphs over $H_{p}$ , Lemma 3.6 gives the following comparison of graphs, which in turn can be regarded as local embeddability in the sense of convex geometry (see Definition 4.16 below).

Theorem 3.8.

$\ell\in\mathcal{L}_{n}$ * proper is $\eta$ -mixable if and only if for all $p\in\textnormal{int}(\Delta^{n})$ the local graph of the translation of $\eta\ell$ to $\lambda(p)$ over the supporting plane to both $M_{\ell_{p}}$ and $M_{\lambda}$ at $\lambda(p)$ , $H_{p}$ , lies above the graph of $\lambda$ over $H_{p}$ .*

Remark 3.9.

We would like to point out the resemblance of Lemma 3.6 to Theorem 10 in [vERW12]. To recover the latter from our point of view we will first reinterpret Lemma 3.6 and Theorem 3.8 from a convex geometry point of view which will lead to a transparent bridge between Lemma 3.6 and [vERW12, Theorem 10].

4. Connections to convex geometry

In this part we reinterpret our results from the point of view of convex geometry. With this interpretation we can relate Theorem 3.8 to results in [vERW12] and [WC22]. We first provide some background and state relevant results from convex geometry which are well-known and can be found in [Sch14] and can be adapted to our setting.

Let $K\subset\mathbb{R}^{n}$ be a convex set, that is

[TABLE]

for all $x,y\in K$ and $\lambda\in[0,1]$ .

We define the recession cone of $K$ as the set

[TABLE]

The boundary of $K$ is denoted by $\partial K$ , as since we will assume that $\partial K$ is a differentiable manifold we denote the interior (as a manifold) of $\partial K$ by $\textnormal{int}(\partial K)$ . As usual the scaling of $K$ by $\eta>0$ and the Minkowski sum of $K$ and $L$ are defined as

[TABLE]

Definition 4.1.

Let $K$ be a closed convex set in $\mathbb{R}^{n}$ . The support function of $K$ , $\sigma(K,u)\colon\mathbb{R}^{n}\longrightarrow\overline{\mathbb{R}}$ , is defined as

[TABLE]

We sometimes denote it as $\sigma_{K}(u)\coloneqq\sigma(K,u)$ .

From the definition we know that

[TABLE]

From [Sch14, Section 1.7] we have the following.

Lemma 4.2 (Properties of $\sigma$ ).

Let $L,K\subset\mathbb{R}^{n}$ be closed convex sets.

(1)

$\sigma_{L}\leq\sigma_{K}$ * if and only if $L\subset K$ .* 2. (2)

$\sigma(K+t,u)=\sigma(K,u)+\langle t,u\rangle$ * for all $t\in\mathbb{R}^{n}$ .* 3. (3)

$\sigma(K+L,u)=\sigma(K,u)+\sigma(L,u)$ .

Definition 4.3.

A function $f\colon D\subset\mathbb{R}^{n}\longrightarrow\overline{\mathbb{R}}$ is convex if its extension to $\mathbb{R}^{n}$ given by

[TABLE]

is convex.

The following lemma is a well-known result (see [Sch14, Theorem 1.7.1] for example).

Lemma 4.4.

Let $f\colon\mathbb{R}^{n}\to\mathbb{R}$ convex, closed and positively homogeneous, then $f$ is the support function of the convex, closed set

[TABLE]

Definition 4.5.

Let $L,K\subset\mathbb{R}^{n}$ and closed and convex. We say that $L$ is a summand of $K$ if there exists a convex, closed set $M\subset\mathbb{R}^{n}$ such that $K=M+L$ .

We will be mainly interested in sets $K$ whose recession cone is $\mathbb{R}^{n}_{\geq 0}$ , hence we denote by $\mathcal{K}^{n}_{*}$ the set of closed, convex sets whose recession cone is $\mathbb{R}^{n}_{\geq 0}$ . In the following we extend some common results in convex geometry which are usually stated for closed, compact convex sets in $\mathbb{R}^{n}$ (see [Sch14]), however, some of them are easily extended to $\mathcal{K}_{*}^{n}$ [Shv01].

Lemma 4.6 (Basic properties of sets in $\mathcal{K}_{*}^{n}$ ).

Let $K,L\in\mathcal{K}_{*}^{n}$ and $\eta>0$ . Then, the following holds:

(1)

$\eta K\in\mathcal{K}_{*}^{n}$ , 2. (2)

$\textnormal{rec}(K+L)=\mathbb{R}^{n}_{\geq 0}$ , 3. (3)

$K+L$ * is closed, and* 4. (4)

$K+L\in\mathcal{K}_{*}^{n}$ .

Proof.

In order to show (1), we need to show that $\eta K$ is closed, convex and $\textnormal{rec}(\eta K)=\mathbb{R}^{n}_{\geq 0}$ . Let $x,y\in\eta K$ and $\lambda\in[0,1]$ , then we have

[TABLE]

where $x=\eta k_{x}$ and $y=\eta k_{y}$ for some $k_{x},k_{y}\in K$ . Since $K$ is convex, then $\lambda k_{x}+(1-\lambda)k_{y}\in K$ and hence $\eta K$ is convex. Let $x_{n}\in K$ be a convergent sequence that converges to $x$ . Then, there exists $k_{x_{n}}\in K$ such that $x_{n}=\eta k_{x_{n}}$ . Since $\eta$ is a constant, $\{k_{x_{n}}\}$ converges to $k_{x_{\infty}}\in K$ (since $K$ is closed). By the uniqueness of the limit, $x=\eta x_{\infty}\in\eta K$ . Now, let $x\in\mathbb{R}^{n}_{\geq 0}$ , we want to show that $\eta K+x\subset\eta K$ . Take any $k\in K$ ,

[TABLE]

since $\frac{1}{\eta}x\in\mathbb{R}^{n}_{\geq 0}$ . Conversely, if $x\in\textnormal{rec}(\eta K)$ , then for any $k_{1}\in K$ , we have

[TABLE]

then there exists $k_{2}\in K$ , such that $\eta k_{1}+x=\eta k_{2}$ . Hence

[TABLE]

thus $\frac{1}{\eta}x\in\textnormal{rec}(K)=\mathbb{R}^{n}_{\geq 0}$ , thus $x\in\mathbb{R}^{n}_{\geq 0}$ .

To show (2), let $x\in\mathbb{R}^{n}_{\geq 0}$ . We want to show that $K+L+x\subset K+L$ . Let $k\in K$ and $l\in L$ , then

[TABLE]

since $l+x\in L$ . Thus $\mathbb{R}^{n}_{\geq 0}\subset\textnormal{rec}(K+L)$ . Now, suppose that there is $x\in\textnormal{rec}(K+L)$ such that $x\notin\mathbb{R}^{n}_{\geq 0}$ . Since $\textnormal{rec}(K+L)$ is a cone, for all $\lambda>0$ , we have $\lambda x\in\textnormal{rec}(K+L)$ . Let $k\in K$ and $l\in L$ . Then

[TABLE]

Thus $\ell+\lambda x\in\textnormal{rec}(K)=\mathbb{R}^{n}_{\geq 0}$ for all $\lambda>0$ , but notice that this is a contradiction since by picking $\lambda$ sufficiently large, $l+\lambda x\notin\mathbb{R}^{n}_{\geq 0}$ . Thus $\textnormal{rec}(K+L)=\mathbb{R}^{n}_{\geq 0}$ .

For (3), see Rockafellar [Roc70] Thm. 8.2 and [Shv01] Thm. 3.1. (4) is simply the combination of (2) and (3) (and the fact that $K+L$ is convex). ∎

We now specialize the discussion to a particular type of sets $K\in\mathcal{K}_{*}^{n}$ . First, suppose that the boundary $\partial K$ is of class $C^{2}$ , then at each point $x\in\textnormal{int}(\partial K)$ there is an outward pointing normal vector $\mathbf{u}_{K}(x)$ . Thus, clearly, we can define a map $\mathbf{u}_{K}\colon\textnormal{int}(\partial K)\longrightarrow\mathbb{S}^{n-1}$ assigning $u_{K}(x)$ to $x\in\textnormal{int}(\partial K)$ . We define

[TABLE]

so that

[TABLE]

Definition 4.7.

Define $C^{2}_{+}(\mathcal{K}_{*}^{n})$ as the collection of sets $K\in\mathcal{K}_{*}^{n}$ with boundary $\partial K$ of class $C^{2}$ , and such that the map $\mathbf{u}_{K}$ is a $C^{1}$ -diffeomorphism from $\textnormal{int}(\partial K)$ to $\mathbb{S}^{n-1}_{-}\coloneqq\mathbb{S}^{n-1}\cap\mathbb{R}^{n}_{<0}$ .

We now specialize some properties of the support function to $C^{2}_{+}(\mathcal{K}_{*}^{n})$ .

Lemma 4.8.

If $K\in C^{2}_{+}(\mathcal{K}_{*}^{n})$ , then $\textnormal{dom}(\sigma_{K})=\textnormal{int}(\mathbb{R}^{n}_{\leq 0})\cup\{0\}$ .

Proof.

Take $u\neq 0$ in $\textnormal{dom}(\sigma_{K})$ , then it must be an outward normal vector to $\textnormal{int}(\partial K)$ , hence it is in $\mathbb{S}^{n-1}_{-}$ . Then $\textnormal{dom}(\sigma_{K})\subset\textnormal{int}(\mathbb{R}^{n}_{\leq 0})\cup\{0\}$ . Now, for $u\in\mathbb{R}^{n}_{<0}$ , normalize it to make it unitary by letting $v=u/|u|$ , then $v\in\mathbb{S}^{n-1}_{-}$ and thus it must be a normal vector form some $x\in\textnormal{int}(\partial K)$ , hence the support function evaluated at $v$ is finite, and in consequence $\sigma_{K}(u)$ is finite too. ∎

Remark 4.9.

Following Schneider [Sch14, Section 2.5] the condition $K\in C^{2}_{+}(\mathcal{K}_{*}^{n})$ is equivalent to assuming the principal curvatures of $\partial K$ to be non-zero. It also follows that

[TABLE]

and moreover, $\sigma_{K}$ is of class $C^{2}$ .

Remark 4.10.

Let $\ell\in\mathcal{L}_{n}$ be a proper loss function. By definition we see that Remark 4.9 implies $\textnormal{spr}(\ell)\in C^{2}_{+}(\mathcal{K}_{*}^{n})$ (since $M_{\ell}=\partial(\textnormal{spr}(\ell))$ ).

Definition 4.11.

Let $K,L\in C^{2}_{+}(\mathcal{K}^{n})$ . We say that $L$ slides freely inside $K$ if to each boundary point $x$ of $K$ , there exists a translation vector $t\in\mathbb{R}^{n}$ , such that $x\in L+t\subset K$ .

Theorem 4.12.

Let $K,L\in C^{2}_{+}(\mathcal{K}_{*}^{n})$ . $L$ is a summand of $K$ , then $L$ slides freely inside $K$ .

Proof.

Suppose that there exists $M\in C^{2}_{+}(\in\mathcal{K}_{*}^{n})$ such that $K=L+M$ . Let $x\in\partial K$ . Then there are $l\in L$ and $m\in M$ such that

[TABLE]

Thus, $x\in L+m\subset L+M=K$ . ∎

Remark 4.13.

For a general convex set $L$ , if $L$ is a summand of $K\in C^{2}_{+}(\mathcal{K}_{*}^{n})$ we see that the previous proof holds an we conclude that $L$ slides freely inside $K$ ; note however that this imposes restrictions on possible sets $L$ . One of this consequences is that the principal curvatures of $\partial L$ must be positive as can be seen from a second fundamental form comparison and Theorem 3.8.

Lemma 4.14.

Let $K,L\in C^{2}_{-}(\mathcal{K}^{n}_{*})$ and suppose that $f(\cdot)=\sigma_{K}(\cdot)-\sigma_{L}(\cdot)$ is convex. Then the set

[TABLE]

is in $C^{2}_{+}(\mathcal{K}^{n}_{*})$ , and it is such that $K=M+L$ , that is, $L$ and $M$ are summands of $K$ .

Proof.

From Lemma 4.8, the domain of $f$ is $\mathbb{R}^{n}_{<0}\cup\{0\}$ , i.e., $f\colon\mathbb{R}^{n}_{<0}\cup\{0\}\longrightarrow\mathbb{R}$ is convex. Thus it is the support function of $M$ (by Lemma 4.4). That is, $f(\cdot)=\sigma_{M}(\cdot)$ .

Therefore we have $\sigma_{M}=\sigma_{K}-\sigma_{L}$ , and hence $K=M+L$ . Note that $M$ is a summand of $K$ , then using Theorem 4.12 we know that $M$ slides freely inside $K$ , and since $\partial K$ has positive principal curvatures then $\partial M$ does too (Remark 4.13). Since $\sigma_{M}$ is of class $C^{2}$ , then $M$ has to be in $C^{2}_{-}(\mathcal{K}^{n}_{*})$ . ∎

Theorem 4.15.

[[Sch14, Theorem 1.5.2]] Let $D\subset\mathbb{R}^{n}$ convex and let $f\colon D\longrightarrow\mathbb{R}$ be a continuous function. Suppose that for each point $x_{0}\in D$ there are an affine function $g$ on $\mathbb{R}^{n}$ and a neighborhood $U$ of $x_{0}$ such that $f(x_{0})=g(x_{0})$ and $f\geq g$ in $U\cap D$ . Then $f$ is convex.

Definition 4.16.

We say that $L$ is locally embeddable in $K$ if for all $x\in\partial K$ , there is a $y\in L$ and a neighborhood $U$ of $y$ , such that $(L\cap U)+x-y\subset K$ .

Theorem 4.17.

Let $K,L\in C^{2}_{-}(\mathcal{K}_{*}^{n})$ and $L$ strictly convex. If $L$ is locally embeddable in $K$ , then $L$ is a summand of $K$ .

Proof.

Let $u_{0}\in\mathbb{S}^{n-1}_{-}$ and $x_{0}\in\partial K$ be a point such that $\mathbf{u}(x_{0})=u_{0}$ . Since $L$ is locally embeddable in $K$ there are $y_{0}\in L$ and a neighborhood $U_{0}$ of $y_{0}$ such that $(L\cap U_{0})+x_{0}-y_{0}\subset K$ . Since $\mathbf{u}^{-1}_{L}\colon\mathbb{S}^{n-1}_{-}\longrightarrow\partial L$ is continuous, there exists a neighborhood $V_{0}$ of $u_{0}$ such that $\mathbf{u}_{K}^{-1}(V_{0})\subset U_{0}$ . Then it follows that $\sigma(L+x_{0}-y_{0},u_{0})=\sigma(K,u_{0})$ and $\sigma(L+x_{0}-y_{0},u)\leq\sigma(K,u)$ for all $u\in V_{0}$ by Lemma 4.2.

Let $f(\cdot)=\sigma(K,\cdot)-\sigma(L,\cdot)$ (this is defined on $\mathbb{R}^{n}_{\leq 0}\cup\{0\}$ and is positively homogeneous), and $g(\cdot)=\langle x-y,\cdot\rangle$ . Then, clearly, we have

(i)

$f(u_{0})=g(u_{0})$ , since

[TABLE] 2. (ii)

$f\geq g$ on $V_{0}$ ,

[TABLE]

It follows by Theorem 4.15 that $f$ is convex, and by Lemma 4.14 we conclude that $L$ is a summand of $K$ . ∎

The following lemma is a direct consequence of the characterization of mixability in Theorem 3.8 and Definition 4.16.

Lemma 4.18.

Let $\ell\in\mathcal{L}_{n}$ be a proper loss. For $\eta>0$ , if $\ell$ is $\eta$ -mixable then $\textnormal{spr}(\eta\ell)$ is locally embeddable in $\textnormal{spr}(\lambda)$ .

Lemma 4.19.

If $\ell$ is $\eta$ -mixable then $\textnormal{spr}(\eta\ell)$ slides freely inside $\textnormal{spr}(\lambda)$ .

Proof.

Let $\ell$ be $\eta$ -mixable, then Lemma 4.18 implies $\textnormal{spr}(\eta\ell)$ it is locally embeddable in $\textnormal{spr}(\lambda)$ . Then Theorem 4.17 implies it is a summand and Theorem 4.12 implies it slides freely inside $\textnormal{spr}(\lambda)$ . ∎

Corollary 4.20.

Let $\ell$ be a $\eta$ -mixable proper loss. Then $\textnormal{spr}(\eta\ell)\in C^{2}_{-}(\mathcal{K}_{*}^{n})$ and it slides freely inside $\textnormal{spr}(\lambda)$ ( $\lambda$ is the log loss). Additionally, there exists $M\in C^{2}_{-}(\mathcal{K}_{*}^{n})$ such that

[TABLE]

Moreover, $\partial M$ can be regarded as $\varrho(\Delta^{n})$ for a 1-mixable proper loss $\varrho$ .

Proof.

Since $\ell$ is an $\eta$ -mixable proper loss function, $\eta\ell$ is also a proper loss function and hence $\textnormal{spr}(\eta\ell)\in C^{2}_{-}(\mathcal{K}_{*}^{n})$ (Remark 4.10). Theorem 3.8 implies that $\textnormal{spr}(\eta\ell)$ is locally embeddable in $\textnormal{spr}(\lambda)$ . From Theorem 4.17 we know that $\textnormal{spr}(\eta\ell)$ is a summand of $\textnormal{spr}(\lambda)$ , which proves the existence of $M$ . As a consequence, $M$ is a convex set with recession cone $\mathbb{R}^{n}_{\geq 0}$ (Lemma 4.14). By applying [WC22, Proposition 21] we can regard $\partial M$ as the image of a proper loss function $\varrho$ , which since $\textnormal{spr}(\varrho)$ is a summand of $\textnormal{spr}(\lambda)$ it is 1-mixable (Lemma 4.14). ∎

We now state [Sch14, Theorem 2.5.4] adapted to our setting which will be helpful to relate our work to [vERW12].

Theorem 4.21.

Let $K,L\in C^{2}_{-}(\mathcal{K}_{*}^{n})$ . Let $h^{M}(x)$ denote the second fundamental form of $M$ at $x$ with respect to $\mathbf{u}$ (see \[email protected]). The following are equivalent:

(i)

$h^{\partial L}(x)\geq h^{\partial K}(y)$ * for all pairs of points $x$ and $y$ at which $\mathbf{u}(x)=\mathbf{u}(y)$ .* 2. (ii)

$\sigma_{K}-\sigma_{L}$ * is a support function.*

Since $\Delta^{n}$ is an affine manifold, the geodesics in $\Delta^{n}$ are simply straight lines. This allows to define convexity of functions defined on $\Delta^{n}$ in the usual way we do for functions on $\mathbb{R}^{n}$ . The following theorem connects and reconciles our results to those in [vERW12]. More precisely, we create a bridge between our results and [vERW12, Theorem 10].

Theorem 4.22.

Let $\ell\in\mathcal{L}_{n}$ be proper loss. Let $\eta>0$ , then $\ell$ is $\eta$ -mixable if and only if $\eta\underline{L}^{\ell}(\cdot)-\underline{L}^{\lambda}(\cdot)$ is convex on $\textnormal{int}(\Delta^{n})$ , where $\underline{L}^{\varrho}(\cdot)$ denotes the Bayes risk of the loss function $\varrho$ (Definition 1.3) and $\lambda$ denotes the log loss.

Proof.

Suppose that $\ell$ is a proper loss in $\mathcal{L}_{n}$ which is $\eta$ -mixable. By Lemma 4.19 $\textnormal{spr}(\eta\ell)$ slides freely inside $\textnormal{spr}(\lambda)$ and in particular $h^{\eta\ell}(\ell(p))\geq h^{\lambda}(\lambda(p))$ . By Theorem 4.21 it follows that $\sigma_{\textnormal{spr}(\lambda)}-\sigma_{\textnormal{spr}(\eta\ell)}$ is a support function with domain $\mathbb{R}^{n}_{<0}\cup\{0\}$ , in particular it is convex on its interior. Let $u\in\mathbb{R}^{n}_{<0}$ , such that the outward normal vector of $\ell(\Delta^{n})$ and $\lambda(\Delta^{n})$ at $\ell(p)$ and $\lambda(p)$ , respectively, is $u$ . Then we have for $x=-p\in\Delta^{n}$ ,

[TABLE]

which proves the claim. ∎

Suppose now that for given $\ell\in\mathcal{L}_{n}$ proper, there exists a $\eta>0$ such that $\textnormal{spr}(\eta\ell)$ slides freely inside $\textnormal{spr}(\lambda)$ . Note that in particular this implies that $\textnormal{spr}(\eta\ell)$ is locally embeddable in $\textnormal{spr}(\lambda)$ , and hence for each $p\in\textnormal{int}(\Delta^{n})$ we have

[TABLE]

which by \[email protected] and Lemma 3.6 implies that $\ell$ is $\eta$ -mixable. Thus combining this with Lemma 4.19 we obtain the following characterization of mixability of proper (sufficiently differentiable) loss functions.

Theorem 4.23.

Let $\ell\in\mathcal{L}_{n}$ be proper. $\ell$ is $\eta$ -mixable if and only if $\textnormal{spr}(\eta\ell)$ slides freely inside $\textnormal{spr}(\lambda)$ , where $\lambda$ denotes the log loss.

In general, the set $\mathcal{L}$ provides a family of loss functions with appealing properties. Arguably, one of the most important properties is that given $\ell\in\mathcal{L}$ , if we assume that $\ell$ is proper then we know its principal curvatures are strictly positive. This is a strong and useful geometric feature. For example, in [WC22] the notion of a “inverse loss” called the anti-polar loss was investigated. Given $\ell$ a proper loss (in the sense of [WC22], which are not necessarily smooth), they consider the 0-homogeneous extension of $\ell$ (see Remark 26 in [WC22]), defined on $\mathbb{R}^{n}_{>0}$ and given by

[TABLE]

where $\|p\|_{1}=p_{1}+...+p_{n}$ . For the following we simply denote $\ell^{\textnormal{ext}}$ by $\ell$ . In [WC22, Proposition 29] it is shown that there exists a map $\ell^{\diamond}\colon\mathbb{R}_{>0}\longrightarrow\mathbb{R}^{n}_{\geq 0}$ such that

[TABLE]

for all $x,p\in\mathbb{R}^{n}_{>0}$ . The map $\ell^{\diamond}$ is called the anti-polar loss of $\ell$ . For the family of admissible loss function $\mathcal{L}$ considered in this work, we exploit the differentiability conditions to obtain in a straightforward way an inverse loss defined on $\ell(\textnormal{int}(\Delta^{n}))$ . To see this, suppose that $\ell\in\mathcal{L}$ is proper. Since this is equivalent to saying that $\textnormal{spr}(\ell)$ is in $C^{2}_{+}(\mathcal{K}_{*}^{n})$ , meaning that the map $\mathbf{u}_{\textnormal{spr}(\ell)}$ is $C^{1}$ diffeomorphism. Then we can define the map $\ell^{-1}\colon\ell(\textnormal{int}(\Delta^{n}))\longrightarrow\textnormal{int}(\Delta^{n})$ by

[TABLE]

which is the inverse of the map $\ell\colon\textnormal{int}(\Delta^{n})\longrightarrow\ell(\textnormal{int}(\Delta^{n}))$ . Recall that $\mathbf{u}_{\partial\textnormal{spr}(\ell)}(x)$ is nothing else than the unit normal vector (pointing towards $\mathbb{R}^{n}_{\geq 0}$ ) at $x\in\ell(\textnormal{int}(\Delta^{n}))$ .

It is of interest of finding parametrizations (or links) that simplify the expression of a given proper loss $\ell$ . At a theoretical level there are potentially many ways to to this. Notably we have at hand the notion of canonical link in [WVR16] (or see Section 2.7 above for $n=2$ ). As an example of other ways to obtain nice links we have Lemma 3.3 above, which gives a nice expression in coordinates (as the form of a graph) of $\ell$ . Unfortunately, to obtain that results one makes uses of the inverse function theorem which does not provide an explicit inverse but rather its existence.

5. Conclusions

We summarize the main messages of this work.

•

Since mixable loss functions are of great importance in prediction games, it is desirable to understand them from different perspectives. Inspired by the work of Vovk [Vov15], in Section 2 we studied binary loss functions from the point of view of differential geometry, hence restricting to loss functions in $\mathcal{L}$ (Definition 2.1). To do this, we re-interpret properness as a geometric property, namely, a loss function $\ell\in\mathcal{L}$ is proper if and only if

–

the normal vector (belonging to $\mathbb{R}^{2}_{\geq 0}$ ) to $M_{\ell}=\ell(\textnormal{int}(\Delta^{2}))$ at $\ell(p)$ is $\frac{p}{|p|}$ , for any $p\in\textnormal{int}(\Delta^{2})$ , and

–

the loss curve $\ell(\textnormal{int}(\Delta^{2}))$ has positive curvature (with respect to $\frac{p}{|p|}$ ).

Having this framework at hand, we characterized mixability and fundamentality of a proper loss $\ell\in\mathcal{L}$ , as a curvature comparison to the log loss $\ell_{\log}$ (cf. [Vov15]).

•

In Section 3, we extended the geometric characterization of proper loss functions to higher dimensions, and obtained the corresponding interpretation of mixability as a geometric comparison (now in terms of the principal curvatures of the “loss surface”). This comparison is done by using the second fundamental forms of the “loss surfaces”.

•

The main goal of Section 4 is to re-interpret the geometric results in Section 3 from the point of view of convex geometry. The main result in this part is a new characterization of $\eta$ -mixability of a proper loss function $\ell\in\mathcal{L}$ , as $\textnormal{spr}(\eta\ell)$ sliding freely inside $\textnormal{spr}(\ell_{\log})$ (in general dimension). This provides an intuitive and geometric way to interpret mixability.

•

Since the results obtained in this work are in terms of curvature, it was necessary to re-interpret well known properties of loss functions in the language of differential geometry. Although this task might seem tedious at first, it is well worth it since it reconciles the results obtain by Vovk [Vov15] for $n=2$ and by van Erven, Reid and Williamson [vERW12] for $n\geq 2$ .

•

It is worth to point out the relation of this work with [vERW12]. Specifically, the bridge between these to works established by Theorem 4.22 connects our results to Theorem 10 in [vERW12] in the following way. In [vERW12, Theorem 10] the following statements are proven to be equivalent:

(i)

a proper loss $\ell\in L$ is $\eta$ -mixable, 2. (ii)

$\eta H\widetilde{\underline{L}}(t)-H\widetilde{\underline{L}}_{\log}(t)$ is positive semi-definite for all $t\in\Phi^{-1}_{\textnormal{std}}(\textnormal{int}(\Delta^{n}))$ , where $HF(t)$ denotes the Hessian of $F$ at $t$ , 3. (iii)

$\eta\underline{L}(p)-\underline{L}_{\log}(p)$ is convex on $\textnormal{int}(\Delta^{n})$ , and 4. (iv)

$\eta\widetilde{\underline{L}}(p)-\widetilde{\underline{L}}_{\log}(p)$ is convex on $\Phi_{\textnormal{std}}^{-1}(\textnormal{int}(\Delta^{n}))$ .

There, they first proved the equivalence of (i) and (ii), which is the result of a long direct computation done very carefully. The equivalence between (iii) and (iv) is straightforward. To connect these two sets of equivalences, standard convex geometry is used to prove the equivalence of (ii) and (iii). Note that the statements (ii) and (iv) make reference to a precise choice of parametrization of $\Delta^{n}$ (i.e., the standard parametrization $\Phi_{\textnormal{std}}$ ), therefore, the work presented here is naturally not related to these statements but rather to (i) and (iii), whose equivalence can be considered to be the content of Sections 3 and 4. Determining whether this new approach provides a simplification of the computations in [vERW12] or not, strongly depends on the differential geometry and convex geometry background of the reader. This work should be considered as complementing the understanding of mixable loss functions and providing a new geometric insight into them.

Appendix A Differential Geometry

In this part we provide a brief summary of the concepts of differential geometry that are used in this work (we assume the reader has some familiarity with the topic although we try to put emphasis on the intuition). We do not intend to give a comprehensive introduction to the topic. Most of the material can be found in almost any differential geometry book, however, we recommend (and when possible use the notation of) [dC16] and [Lee18].

A.1. Curvature of Curves

A parametrized curve is a differentiable map $\alpha\colon(a,b)\to\mathbb{R}^{n}$ , ( $a<b$ ). We are interested in studying the geometry of parametrized curves. For this it would be useful to restrict our discussions to curves with a well defined tangent line at every point $\alpha(t)$ for $t\in(a,b)$ (i.e., with non-vanishing $\alpha^{\prime}(t)$ ). These curves are called regular. Let $\varphi\colon(a,b)\to(c,d)$ be a diffeomorphism, the curve $\beta=\alpha(\varphi(s))$ is a reparametrization of $\alpha$ . Note that in this case $\alpha((a,b))=\beta((c,d))$ . The image $M=\alpha((a,b))$ is a 1-dimensional differentiable manifold in $\mathbb{R}^{n}$ (for this it is essential to restrict to regular curves). The study of curves is of particular importance since some aspects are carried to the study of the geometry of general hypersurfaces in $\mathbb{R}^{n}$ .

Typically, curvature is defined for curves parametrized by arc-length meaning that $|\beta^{\prime}(s)|=1$ for all $s\in(c,d)$ (and a regular curve can always be parametrized this way). For these types of curves, the curvature of $\beta$ at $\beta(s)$ is defined as the length of $\beta^{\prime\prime}(s)$ , which measures “how much” a curve “curves”. However, this notion does not give information about the direction on which a curve is “curving”. We start looking at the case $n=2$ . We define the signed curvature of a general curve $\alpha(t)=(x_{1}(t),x_{2}(t))$ by (cf. \[email protected])

[TABLE]

It can be checked that $|\kappa(t)|$ coincides with the curvature of $\alpha$ when parametrized by arc-length (at the corresponding point), the signed curvature is well defined up to a sign (the sign will change if we consider a reparametrization that reverses the order of $(a,b)$ , for example a curve defined on $(-b,-a)$ given by $\beta(s)=\alpha(-s)$ ), which motivates the discussion in Section 1.5.

For example, suppose that a planar curver is defined by a function $f\colon(a,b)\longrightarrow\mathbb{R}$ is the following way:

[TABLE]

for $t\in(a,b)$ . A quick computation gives

[TABLE]

Given a regular curve $\alpha\colon(a,b)\longrightarrow\mathbb{R}^{3}$ as above and a real number $\eta\neq 0$ , it is straightforward to see that the curve $\beta(t)=\eta\alpha(t)$ is also a regular curve and its signed curvature is given by

[TABLE]

The notion of signed curvature can be extended to curves in manifolds sitting inside $\mathbb{R}^{n}$ (see for example [Lee18, Chapter 8]). For $\alpha(-\varepsilon,\varepsilon)\longrightarrow\mathbb{R}^{n}$ parametrized by arc-length, the signed curvature (with respect to $\mathbf{n}$ ) $\kappa^{+}_{\alpha}$ of $\alpha$ at $p=\alpha(0)$ is given by $\kappa^{+}_{\alpha}(0)=\langle\mathbf{n},\alpha^{\prime\prime}(0)\rangle$ . It can be shown that this definition agrees with the one we gave for $n=2$ .

A.2. Geometry of hypersurfaces in $\mathbb{R}^{n}$

Let $M$ be a differentiable hypersurface inside $\mathbb{R}^{n}$ of class $C^{k}$ (i.e., a $n-1$ -dimensional $C^{k}$ manifold). By this we mean that for each $p\in M$ there is an open set $U\subset\mathbb{R}^{n-1}$ and a $C^{k}$ injective map $\Phi\colon U\longrightarrow M$ (called a parametrization of $M$ around $p$ ). For each $x\in U$ , $\{\partial_{1}\Phi(x),...,\partial_{n-1}\Phi(x)\}$ forms a basis for the tangent space $T_{q}M$ ( $q=\Phi(x)$ ) to $M$ at $q$ . Since $\Phi(U)\subset\mathbb{R}^{n}$ we can consider the induced metric on $M$ by the Euclidean metric in $\mathbb{R}^{n}$ (denoted by $\langle\cdot,\cdot\rangle$ ). This is a Riemannian metric on $M$ given on the coordinates given by $\Phi$ by the matrix

[TABLE]

for $x\in U$ . The metric $g$ allows us to define the length of a curves in $M$ .

In general, if a manifold $M$ of dimension $n-1$ is sitting inside an $n$ -dimensional Riemannian manifold $\overline{M}$ (and $M$ is endowed with the induced metric from $M$ ) the second fundamental form carries the information on how $M$ is “curved” inside $\overline{M}$ . Let $\overline{g}$ be the metric on $\overline{M}$ and $g$ the induced metric on $M$ by $\overline{g}$ . Let $\overline{\nabla}$ denote the Levi–Civita connection of $\overline{g}$ . Let $\mathbf{n}$ be a smooth unit normal vector field to $M$ (that is $\mathbf{n}(p)$ is perpendicular to $T_{p}M$ for each $p\in M$ ). The scalar second fundamental form of $M$ with respect to $\mathbf{n}$ is the covariant 2-tensor $h$ on $M$ defined as

[TABLE]

for $X,Y$ tangent vectors to $M$ . Note that for a hypersurface, at each point we have exactly to unit normal vectors to $M$ at $p$ , thus the scalar second fundamental form is well-defined up to a sign. Fixing a point $p\in M$ and an orthonormal basis $\{E_{1},...,E_{n-1}\}$ for the tangent space at $p$ $T_{p}M$ , the eigenvalues of the matrix given by $h_{ij}=h(E_{i},E_{j})$ for $i,j=1,...,n-1$ are called the principal curvatures of $M$ at $p$ and the corresponding eigenspaces are called the principal directions. For details of the above see Chapter 8 in [Lee18].

When $\overline{M}=\mathbb{R}^{n}$ and $M$ is parametrized by $\Phi\colon U\subset\mathbb{R}^{n-1}\longrightarrow M\subset\mathbb{R}^{n}$ , with respect to the local frame $\{\partial_{1}\Phi,...,\partial_{n-1}\Phi\}$ of $\Phi(U)$ , the scalar second fundamental form with respect to a normal unit vector field $\mathbf{n}$ is given by ([Lee18, Proposition 8.23])

[TABLE]

for $i,j=1,...,n-1$ .

Given any $p\in M$ and $v\in T_{p}M$ , there a geodesic $\gamma_{V}\colon(a,b)\longrightarrow M$ of $M$ passing through $p$ with velocity $v$ at $p$ . Let $M_{1}$ and $M_{2}$ be two hypersurfaces in $\mathbb{R}^{n+1}$ tangent at a point $p\in M_{1}\cap M_{2}$ . Choose a normal vector $\mathbf{n}$ and suppose that $M_{1}$ lies above $M_{2}$ (with respect to $\mathbf{n}$ ). We have the following lemma from [Lee18].

With the previous lemma we can obtain a comparison result for manifolds with positive principal curvatures.

Lemma A.1.

Suppose that $M_{1}$ and $M_{2}$ are tangent at $p\in M_{1}\cap M_{2}$ and fix a normal vector $\mathbf{n}$ at $p$ . Suppose that $M_{1}$ and $M_{2}$ have positive principal curvatures at $p$ . Then $h_{1}(v,v)\geq h_{2}(v,v)$ for all $v\in T_{p}M$ if and only if $M_{1}$ lies above $M_{2}$ (with respect to $\mathbf{n}$ ) locally around $p$ .

Proof.

First we make the following observation. Suppose that $M$ is a smooth hypersurface in $\mathbb{R}^{n}$ and we have a regular curve $\alpha\colon(-\varepsilon,\varepsilon)\longrightarrow M$ such that $\alpha(0)=p$ and $\alpha^{\prime}(0)=v$ for some $p\in M$ and $v\in T_{p}M$ . Then, letting $h$ denote the second fundamental form of $M$ from \[email protected] we have

[TABLE]

Thus, if $\alpha$ is parametrized by arc-length, $h(v,v)=\langle\mathbf{n},\alpha^{\prime\prime}(0)\rangle=\kappa_{\alpha}^{+}(0)$ .

Suppose $M_{1}$ lies above $M_{2}$ are tangent at $p$ and let $v\in T_{p}M_{1}=T_{p}M_{2}$ with $|v|=1$ . Then we can intersect $M_{1}$ and $M_{2}$ with the plane generated by $v$ and $\mathbf{n}$ . Then we obtain two curves $\alpha_{1}$ and $\alpha_{2}$ on $M_{1}$ and $M_{2}$ , respectively, such that $\alpha_{i}(0)=p$ and $\alpha^{\prime}(0)=v$ for $i=1,2$ . Moreover, we can assume that these curves are parametrized by arc-length so its Euclidean curvature is given by $\langle\alpha^{\prime\prime}_{i}(0),\mathbf{n}\rangle$ . Since we can regard these curves as planar curves, there are functions $f_{1}$ and $f_{2}$ such that the curves $\alpha_{1}$ and $\alpha_{2}$ are represented in the plane $\langle v,\mathbf{n}\rangle$ by the curves

[TABLE]

with $f_{i}=(0)$ , $f_{i}^{\prime}(0)=v$ , $f_{i}^{\prime\prime}(0)>0$ (since $M_{1}$ and $M_{2}$ have positive principal curvatures at $p$ ) for $i=1,2$ . By construction $\kappa_{\gamma_{1}}^{+}(0)=f_{i}^{\prime\prime}(0)$ and by definition $\kappa_{\gamma_{i}}^{+}(0)=\langle\alpha_{i}^{\prime\prime}(0),\mathbf{n}\rangle$ , for $i=1,2$ .

If $M_{1}$ lies above $M_{2}$ at $p$ , then $f_{1}^{\prime\prime}(0)>f_{2}^{\prime\prime}(0)$ and hence $\kappa_{\gamma_{1}}^{+}(0)\geq\kappa_{\gamma_{2}}^{+}(0)$ , which is equivalent to $h_{1}(v,v)\geq h_{2}(v,v)$ for any $v\in T_{p}M$ with $|v|=1$ . Let $w\neq 0\in T_{p}M$ be arbitrary, then

[TABLE]

as claimed.

Conversely if \[email protected] holds, then we see that in particular holds for unitary $v$ , which ultimately means that $f_{1}^{\prime\prime}(0)\geq f_{2}^{\prime\prime}(0)$ for all unitary $v\in T_{p}M$ . This implies that $M_{1}$ lies above $M_{2}$ . ∎

We present the following instructive example.

Example A.2.

Consider the differentiable function $f_{\kappa}(x,y)=\kappa(x^{2}+y^{2})$ with $\kappa>0$ , and let $M_{\kappa}=\{(x,y,f_{\kappa}(x,y))\,|\,(x,y)\in B_{1}(0)\}$ . We choose the parametrization $\Phi_{\kappa}(x)=(x,f_{\kappa}(x))$ of $M_{\kappa}$ and compute the scalar second fundamental form of $M_{\kappa}$ at $p=(0,0,0)$ in these coordinates. We have

[TABLE]

thus from \[email protected] at the point $\Phi_{\kappa}(0,0)=(0,0,0)$ , the scalar second fundamental form of $M_{\kappa}$ with respect to $\mathbf{n}=(0,0,1)$ is given by

[TABLE]

and in particular for $\kappa=1$ we have

[TABLE]

Thus, clearly we have

[TABLE]

which is positive definite if and only if $\kappa>1$ (when $M_{\kappa}$ lies inside $M_{1}$ and are tangent at $p$ ).

Remark A.3.

We stress a technical observation. The comparison \[email protected] in Example A.2 is valid since regardless of the value of $\kappa$ , $\partial_{x}\Phi_{\kappa}(0,0)$ and $\partial_{y}\Phi_{\kappa}(0,0)$ are the same, meaning that we can identify the tangent spaces to $M_{\kappa}$ and $M_{1}$ at $p$ for all $\kappa$ , and the basis for them is given by $\{\partial_{x}\Phi_{1}(0,0),\partial_{y}\Phi_{\kappa}(0,0)\}$ . In general this is not necessarily the case so one should perform a change of basis before comparing the second fundamental forms.

Bibliography19

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[BSS 05] Andreas Buja, Werner Stuetzle, and Yi Shen. Loss functions for binary class probability estimation and classification: Structure and applications. Technical report, University of Pennsylvania , 2005.
2[d C 16] Manfredo P. do Carmo. Differential geometry of curves & surfaces . Dover Publications, Inc., Mineola, NY, 2016. Revised & updated second edition of [ MR 0394451].
3[HKW 95] David Haussler, Jyrki Kivinen, and Manfred K. Warmuth. Tight worst-case loss bounds for predicting with expert advice. In Computational learning theory (Barcelona, 1995) , volume 904 of Lecture Notes in Comput. Sci. , pages 69–83. Springer, Berlin, 1995.
4[HKW 98] David Haussler, Jykri Kivinen, and Manfred K. Warmuth. Sequential prediction of individual sequences under general loss functions. IEEE Transactions on Information Theory , 44(5):1906–1925, 1998.
5[Lee 18] John M. Lee. Introduction to Riemannian manifolds , volume 176 of Graduate Texts in Mathematics . Springer, Cham, 2018. Second edition of [ MR 1468735].
6[MW 18] Zakaria Mhammedi and Robert C Williamson. Constant regret, generalized mixability, and mirror descent. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018.
7[RFWM 15] Mark D. Reid, Rafael M. Frongillo, Robert C. Williamson, and Nishant Mehta. Generalized mixability via entropic duality. In Peter Grünwald, Elad Hazan, and Satyen Kale, editors, Proceedings of The 28th Conference on Learning Theory , volume 40 of Proceedings of Machine Learning Research , pages 1501–1522, Paris, France, 03–06 Jul 2015. PMLR.
8[Roc 70] R. Tyrrell Rockafellar. Convex analysis . Princeton Mathematical Series, No. 28. Princeton University Press, Princeton, N.J., 1970.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

The Geometry of Mixability

Abstract.

1. Introduction

1.1. Mixable games and characterizations of mixable and fundamental loss functions

Theorem 1.1** (Informal statement).**

1.2. Description of results and structure of the article

1.3. Setup

Definition 1.2**.**

Definition 1.3**.**

Definition 1.4**.**

Definition 1.5**.**

1.4. Motivation

Remark 1.6**.**

1.5. Comments about the curvature of planar curves

Remark 1.7**.**

1.6. Reconciling this point of view with previous works

2. Properness and Mixability for Binary Classification

Definition 2.1**.**

Remark 2.2**.**

Definition 2.3**.**

2.1. Proper losses

Lemma 2.4**.**

Proof.

Corollary 2.5**.**

Proof.

Lemma 2.6**.**

Proof.

Lemma 2.7**.**

Proof.

Remark 2.8**.**

2.2. Mixable loss functions

Remark 2.9**.**

2.3. Mixability and curvature

Lemma 2.10**.**

Lemma 2.11**.**

Proof.

Remark 2.12**.**

Lemma 2.13**.**

Proof.

Remark 2.14**.**

2.4. Geometric comparison of loss functions

Definition 2.15**.**

2.5. Mixability and fundamentality as comparison to the log loss

Lemma 2.16**.**

Definition 2.17**.**

Definition 2.18**.**

Theorem 2.19**.**

2.6. Constructing new mixable losses from previous

2.7. Composite losses and the canonical link

Definition 2.20**.**

Corollary 2.21**.**

Remark 2.22**.**

Definition 2.23**.**

Theorem 2.24**.**

Proof.

3. Mixability for Multi-Class Classification

Definition 3.1**.**

Lemma 3.2**.**

3.1. Representing proper loss functions as graphs over Euclidean spaces

Lemma 3.3**.**

3.1.1. MλM_{\lambda}Mλ​ as a graph

Remark 3.4**.**

3.2. Geometric interpretation of mixability

Definition 3.5** (η\etaη-Mixability at p∈Δnp\in\Delta^{n}p∈Δn).**

Lemma 3.6**.**

Proof.

Remark 3.7**.**

Theorem 3.8**.**

Remark 3.9**.**

4. Connections to convex geometry

Definition 4.1**.**

Lemma 4.2** (Properties of σ\sigmaσ).**

Definition 4.3**.**

Theorem 1.1 (Informal statement).

Definition 1.2.

Definition 1.3.

Definition 1.4.

Definition 1.5.

Remark 1.6.

Remark 1.7.

Definition 2.1.

Remark 2.2.

Definition 2.3.

Lemma 2.4.

Corollary 2.5.

Lemma 2.6.

Lemma 2.7.

Remark 2.8.

Remark 2.9.

Lemma 2.10.

Lemma 2.11.

Remark 2.12.

Lemma 2.13.

Remark 2.14.

Definition 2.15.

Lemma 2.16.

Definition 2.17.

Definition 2.18.

Theorem 2.19.

Definition 2.20.

Corollary 2.21.

Remark 2.22.

Definition 2.23.

Theorem 2.24.

Definition 3.1.

Lemma 3.2.

Lemma 3.3.

3.1.1. $M_{\lambda}$ as a graph

Remark 3.4.

Definition 3.5 ( $\eta$ -Mixability at $p\in\Delta^{n}$ ).

Lemma 3.6.

Remark 3.7.

Theorem 3.8.

Remark 3.9.

Definition 4.1.

Lemma 4.2 (Properties of $\sigma$ ).

Definition 4.3.

Lemma 4.4.

Definition 4.5.

Lemma 4.6 (Basic properties of sets in $\mathcal{K}_{*}^{n}$ ).

Definition 4.7.

Lemma 4.8.

Remark 4.9.

Remark 4.10.

Definition 4.11.

Theorem 4.12.

Remark 4.13.

Lemma 4.14.

Theorem 4.15.

Definition 4.16.

Theorem 4.17.

Lemma 4.18.

Lemma 4.19.

Corollary 4.20.

Theorem 4.21.

Theorem 4.22.

Theorem 4.23.

A.2. Geometry of hypersurfaces in $\mathbb{R}^{n}$

Lemma A.1.

Example A.2.

Remark A.3.