Information Losses in Neural Classifiers from Sampling

Brandon Foggo; Nanpeng Yu; Jie Shi; Yuanqi Gao

arXiv:1902.05991·cs.LG·January 9, 2020

Information Losses in Neural Classifiers from Sampling

Brandon Foggo, Nanpeng Yu, Jie Shi, Yuanqi Gao

PDF

TL;DR

This paper investigates how finite training datasets cause information loss in neural classifiers, providing bounds that are less sensitive to input compression and align well with experimental observations.

Contribution

It establishes a relationship between information loss and total variation of neural models, deriving dataset size bounds that improve upon previous bounds without relying on model complexity.

Findings

01

Bounds on information loss are smaller and less sensitive to input compression.

02

The bounds align well with experimental results on neural network information compression.

03

Theoretical insights explain recent experimental observations of information compression.

Abstract

This paper considers the subject of information losses arising from the finite datasets used in the training of neural classifiers. It proves a relationship between such losses as the product of the expected total variation of the estimated neural model with the information about the feature space contained in the hidden representation of that model. It then bounds this expected total variation as a function of the size of randomly sampled datasets in a fairly general setting, and without bringing in any additional dependence on model complexity. It ultimately obtains bounds on information losses that are less sensitive to input compression and in general much smaller than existing bounds. The paper then uses these bounds to explain some recent experimental findings of information compression in neural networks which cannot be explained by previous work. Finally, the paper shows that…

Equations128

\hat{I} (X; Y) := E_{\hat{P}_{X Y}} l o g \frac{d P ^ _{X Y}}{d ( P _{X} \otimes P ^ _{Y} )}

\hat{I} (X; Y) := E_{\hat{P}_{X Y}} l o g \frac{d P ^ _{X Y}}{d ( P _{X} \otimes P ^ _{Y} )}

h_{2} (P_{e}) + P_{e} l o g_{2} (∣ Y ∣ - 1) \geq H (Y) - I (Y; Z)

h_{2} (P_{e}) + P_{e} l o g_{2} (∣ Y ∣ - 1) \geq H (Y) - I (Y; Z)

I_{L oss}^{(1)} ≜ ∣ I (Y; Z) - \hat{I} (Y; Z) ∣

I_{L oss}^{(1)} ≜ ∣ I (Y; Z) - \hat{I} (Y; Z) ∣

I_{L oss}^{(1)} \leq O (\frac{∣ Y ∣}{2 m} 2^{I (X; Z)})

I_{L oss}^{(1)} \leq O (\frac{∣ Y ∣}{2 m} 2^{I (X; Z)})

p (z ∣ x) s u p

p (z ∣ x) s u p

I (X; Z) = I

p (z ∣ x) s u p

p (z ∣ x) s u p

I (X; Z) = I

I_{L oss, ϵ}^{(2)} (I) ≜ I (Y; Z_{ϵ}^{*} (I)) - I (Y; \hat{Z}_{ϵ} (I))

I_{L oss, ϵ}^{(2)} (I) ≜ I (Y; Z_{ϵ}^{*} (I)) - I (Y; \hat{Z}_{ϵ} (I))

I_{L oss, ϵ}^{(2)} (I) \leq 2 K (\cdot) + ϵ

I_{L oss, ϵ}^{(2)} (I) \leq 2 K (\cdot) + ϵ

m_{l} (a, b) := min {p_{Y ∣ X} (b ∣ a), \overset{p}{^}_{Y ∣ X} (b ∣ a)}

m_{l} (a, b) := min {p_{Y ∣ X} (b ∣ a), \overset{p}{^}_{Y ∣ X} (b ∣ a)}

ρ := \int (y \sum m_{l} (x, y)) d P_{X}

ρ := \int (y \sum m_{l} (x, y)) d P_{X}

p_{U_{1}, U_{2}} (u_{1}, u_{2})

p_{U_{1}, U_{2}} (u_{1}, u_{2})

p_{V_{1}, V_{2}} (v_{1}, v_{2})

p_{W_{1}, W_{2}} (w_{1}, w_{2})

{(\tilde{X}, \tilde{Y}) = (\hat{X}, \hat{Y}) = (U_{1}, U_{2}) (\tilde{X}, \tilde{Y}) = (V_{1}, V_{2}), (\hat{X}, \hat{Y}) = (W_{1}, W_{2}), if J = 1 if J = 0

{(\tilde{X}, \tilde{Y}) = (\hat{X}, \hat{Y}) = (U_{1}, U_{2}) (\tilde{X}, \tilde{Y}) = (V_{1}, V_{2}), (\hat{X}, \hat{Y}) = (W_{1}, W_{2}), if J = 1 if J = 0

γ_{\hat{Z} ∣ \hat{X}} = γ_{\tilde{Z} ∣ \tilde{X}} = p_{Z ∣ X}

γ_{\hat{Z} ∣ \hat{X}} = γ_{\tilde{Z} ∣ \tilde{X}} = p_{Z ∣ X}

1 - ρ = γ (\tilde{Y} = \hat{Y} ∣ \tilde{X} = \hat{X}) = E_{P_{X}} [\frac{1}{2} y \sum ∣ p (y ∣ x) - \overset{p}{^} (y ∣ x) ∣]

1 - ρ = γ (\tilde{Y} = \hat{Y} ∣ \tilde{X} = \hat{X}) = E_{P_{X}} [\frac{1}{2} y \sum ∣ p (y ∣ x) - \overset{p}{^} (y ∣ x) ∣]

I (Y; Z) - \hat{I} (Y; Z) \leq \overset{ˉ}{δ} (\hat{P}) I (X; Z) + h_{2} (\overset{ˉ}{δ} (\hat{P}))

I (Y; Z) - \hat{I} (Y; Z) \leq \overset{ˉ}{δ} (\hat{P}) I (X; Z) + h_{2} (\overset{ˉ}{δ} (\hat{P}))

I (Y; Z) - \hat{I} (Y; Z) = I (\tilde{Y}; \tilde{Z}) - I (\hat{Y}; \hat{Z})

I (Y; Z) - \hat{I} (Y; Z) = I (\tilde{Y}; \tilde{Z}) - I (\hat{Y}; \hat{Z})

I (\tilde{Y}; \tilde{Z}) = I (\tilde{Y}; \tilde{Z} ∣ \tilde{X}) + I (\tilde{X}; \tilde{Z}) - I (\tilde{X}; \tilde{Z} ∣ \tilde{Y})

I (\tilde{Y}; \tilde{Z}) = I (\tilde{Y}; \tilde{Z} ∣ \tilde{X}) + I (\tilde{X}; \tilde{Z}) - I (\tilde{X}; \tilde{Z} ∣ \tilde{Y})

I (\hat{Y}; \hat{Z}) = I (\hat{Y}; \hat{Z} ∣ \hat{X}) + I (\hat{X}; \hat{Z}) - I (\hat{X}; \hat{Z} ∣ \hat{Y})

I (\tilde{Y}; \tilde{Z}) - I (\hat{Y}; \hat{Z}) = I (\hat{X}; \hat{Z} ∣ \hat{Y}) - I (\tilde{X}; \tilde{Z} ∣ \tilde{Y})

I (\tilde{Y}; \tilde{Z}) - I (\hat{Y}; \hat{Z}) = I (\hat{X}; \hat{Z} ∣ \hat{Y}) - I (\tilde{X}; \tilde{Z} ∣ \tilde{Y})

I (\hat{X}; \hat{Z} ∣ \hat{Y}) = I (\hat{Z}; \hat{X} ∣ J, \hat{Y}) + I (\hat{Z}; J ∣ \hat{Y}) - I (\hat{Z}; J ∣ \hat{X}, \hat{Y})

I (\hat{X}; \hat{Z} ∣ \hat{Y}) = I (\hat{Z}; \hat{X} ∣ J, \hat{Y}) + I (\hat{Z}; J ∣ \hat{Y}) - I (\hat{Z}; J ∣ \hat{X}, \hat{Y})

I (\tilde{X}; \tilde{Z} ∣ \tilde{Y}) = I (\tilde{Z}; \tilde{X} ∣ J, \tilde{Y}) + I (\tilde{Z}; J ∣ \tilde{Y}) - I (\tilde{Z}; J ∣ \tilde{X}, \tilde{Y})

ρ I (\hat{Z}; \hat{X} ∣ J = 1, \hat{Y}) + (1 - ρ) I (\hat{Z}; \hat{X} ∣ J = 0, \hat{Y})

ρ I (\hat{Z}; \hat{X} ∣ J = 1, \hat{Y}) + (1 - ρ) I (\hat{Z}; \hat{X} ∣ J = 0, \hat{Y})

=

I (\tilde{Z}; \tilde{X} ∣ J, \tilde{Y}) = ρ I (\tilde{Z}; U_{1} ∣ U_{2}) + \overset{ˉ}{δ} (\hat{P}) I (\tilde{Z}; V_{1} ∣ V_{2})

I (\tilde{Z}; \tilde{X} ∣ J, \tilde{Y}) = ρ I (\tilde{Z}; U_{1} ∣ U_{2}) + \overset{ˉ}{δ} (\hat{P}) I (\tilde{Z}; V_{1} ∣ V_{2})

\overset{ˉ}{δ} (\hat{P}) (I (\hat{Z}; W_{1} ∣ W_{2}) - I (\tilde{Z}; V_{1} ∣ V_{2})) + I (\hat{Z}; J ∣ \hat{Y}) - I (\tilde{Z}; J ∣ \tilde{Y})

\overset{ˉ}{δ} (\hat{P}) (I (\hat{Z}; W_{1} ∣ W_{2}) - I (\tilde{Z}; V_{1} ∣ V_{2})) + I (\hat{Z}; J ∣ \hat{Y}) - I (\tilde{Z}; J ∣ \tilde{Y})

I (\hat{Z}; W_{1} ∣ W_{2}) \leq I (\hat{Z}; \hat{X} ∣ W_{2}) \leq I (\hat{Z}; \hat{X}) = I (X; Z)

I (\hat{Z}; W_{1} ∣ W_{2}) \leq I (\hat{Z}; \hat{X} ∣ W_{2}) \leq I (\hat{Z}; \hat{X}) = I (X; Z)

I (\tilde{Z}; V_{1} ∣ V_{2}) \leq I (\tilde{Z}; \tilde{X} ∣ V_{2}) \leq I (\tilde{Z}; \tilde{X}) = I (X; Z)

I (Z; W_{1} ∣ W_{2}) - I (\tilde{Z}; V_{1} ∣ V_{2}) \leq I (X; Z)

I (Z; W_{1} ∣ W_{2}) - I (\tilde{Z}; V_{1} ∣ V_{2}) \leq I (X; Z)

I (\hat{Z}; J ∣ \hat{Y}) - I (\tilde{Z}; J ∣ \tilde{Y}) \leq H (J) = h_{2} (\overset{ˉ}{δ} (\hat{P}))

I (Y; Z) - \hat{I} (Y; Z) \leq \overset{ˉ}{δ} (\hat{P}) I (X; Z) + h_{2} (\overset{ˉ}{δ} (\hat{P}))

I (Y; Z) - \hat{I} (Y; Z) \leq \overset{ˉ}{δ} (\hat{P}) I (X; Z) + h_{2} (\overset{ˉ}{δ} (\hat{P}))

I (X; Y) - \hat{I} (X; Y) \leq \overset{ˉ}{δ} (\hat{P}) H (X) + h_{2} (\overset{ˉ}{δ} (\hat{P}))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Information Losses in Neural Classifiers from Sampling

Brandon Foggo, Nanpeng Yu, Jie Shi, and Yuanqi Gao

Abstract

111© 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

This paper considers the subject of information losses arising from the finite datasets used in the training of neural classifiers. It proves a relationship between such losses as the product of the expected total variation of the estimated neural model with the information about the feature space contained in the hidden representation of that model. It then bounds this expected total variation as a function of the size of randomly sampled datasets in a fairly general setting, and without bringing in any additional dependence on model complexity. It ultimately obtains bounds on information losses that are less sensitive to input compression and in general much smaller than existing bounds. The paper then uses these bounds to explain some recent experimental findings of information compression in neural networks which cannot be explained by previous work. Finally, the paper shows that not only are these bounds much smaller than existing ones, but that they also correspond well with experiments.

I Introduction

An estimator is limited to the information that it has about the variable it’s estimating. But this information is limited to what the estimator has seen from the samples training it. The full information of a random variable cannot be transferred to an estimator by finite samples - some information is lost. This paper analyzes such losses for neural network classifiers. Analyzing these losses can lead to improved architecture designs and training data selection strategies, and provide explanations for empirical results in machine learning theory.

The study of these loses as a tool for deep learning theory arose from the attempts to understand neural network behavior through the concept of an information bottleneck [1, 2]. This theory was later investigated both analytically [3] and experimentally [4, 5]. They are used, primarily, as an explanatory tool which can act as a supplement to classical statistical learning theory (CSLT), which typically fails to explain the success of deep learning models (for example, deep networks tend to perform better when they have higher VC dimension, while CSLT would predict the opposite). We will further discuss the utility of these losses in section III, and we will denote this newly arising field of deep learning theory as information theoretic deep learning theory (ITDLT).

But this theory is still somewhat incomplete. The reader will find that reference [5] above actually contradicts the others - giving experimental evidence against some of the claims established in the earlier works. In particular, ITDLT, as it previously stood, would claim that neural networks should always act as a lossy compressor of the input data - a claim which arises from bounds on information losses that are exponential in the information content of the final hidden layer of the network (while still being smaller than CSLT bounds for larger networks). But experiments show that this is only sometimes true. While compression does seem to always occur when using saturating activation functions, like sigmoid and tanh, compression in networks using linear and relu activation functions seems to be more nuanced.

But instead of abandoning ITDLT, we believe that the theory can be improved in such a way that it explains all of these experiments. Since most contrary evidence to the theory can be traced to those exponential bounds, we hypothesize that these bounds, while tighter than those of CSLT, are still not quite tight enough to account for every experiment. In this paper, we aim to derive bounds which are much tighter than the existing ones. This will make up the bulk of this paper, and can be found in section IV.

With these new bounds, we will be able to explain the experimental discrepancy found in the above literature, giving detail into why some situations yield neural network compression, even with relu activation functions, and others do not. For example, in the case of low entropy feature spaces, our bounds show that there is simply not enough information to lose such that compression is beneficial. We will illustrate this concept further in section V-A.

This will lead to a better understanding of the information relationships found in neural networks, and to a better understanding of neural networks in general. This better understanding will allow guided development of network architectures and other algorithms which are theoretically sound.

In one critical step to achieving these bounds (Theorem 1), we decompose information losses as a product of a term that mostly depends on network architecture and a term that mostly depends on the training dataset used to train that architecture. This decomposition can thus be applied to network architecture design and training data selection strategies independently. These aspects of applying this theory will be the subject of future work.

Finally, while these new bounds are much tighter than both CSLT bounds and the old ITDLT bounds, and while they are capable of explaining all experiments in literature, we will see experimentally that these bounds are fairly tighter than they needed to be to achieve our goals. This will be shown experimentally in section V-B.

Section II will address some notations and assumptions that we will use throughout the paper. Section III will provide more details into the literary background and motivation of this work. We conclude in section VI.

II Notation and Assumptions

Capital letters denote random variables. Lower case letters describe instances of the corresponding random variable. Figure 1 depicts the classification model used in this paper. A class variable $y$ generates a feature vector $x$ according to a fixed (unknown) distribution $\mathbb{P}_{X|Y}$ . This feature vector is then fed through a learned distribution $\mathbb{P}_{Z|X}$ , which acts as a lossy compressor of $x$ . This should be thought of as the hidden layers of a neural network. $z$ is then used to form an estimator of $y$ , denoted $\tilde{y}$ . We will drop the subscripts on probability distributions when the context is clear. The calligraphic symbols $\mathcal{X}$ and $\mathcal{Y}$ refer to the set of values that $X$ and $Y$ can take on. We assume that $\mathcal{X}$ is a Polish space such as $\mathbb{R}^{d}$ and that $\mathcal{Y}$ is a finite set with the discrete topology.

This model has three variables of interest, $X,Y$ and $Z$ which satisfy the Markov chain $Y-X-Z$ . We denote the true model as ${\mathbb{P}_{XYZ}=\mathbb{P}_{X}\mathbb{P}_{Z|X}\mathbb{P}_{Y|X}}$ and consider the case of estimating the conditional probability distribution $\mathbb{P}_{Y|X}$ . We denote this estimate as $\hat{\mathbb{P}}_{Y|X}$ and denote the estimated full model as ${\hat{\mathbb{P}}_{XYZ}=\mathbb{P}_{X}\mathbb{P}_{Z|X}\hat{\mathbb{P}}_{Y|X}}$ . We will use the hat notation for all information theoretic quantities referring to the estimated model. For example:

[TABLE]

Finally, we assume that all distributions can be written as density functions such as $p_{XY}(x,y)$ . We will occasionally drop the variable-specifying subscript when the context is clear. We will assume that the support of $p(x)$ is all of $\mathcal{X}$ .

III Background

III-A The Information Bottleneck Principle

The use of the compressor $p_{Z|X}$ comes from the Information Bottleneck Problem [1] which attempts to find a variable $Z$ that is minimally sufficient for the input pair of variables $(X,Y)$ . The minimal sufficiency of $Z$ refers to the following two properties. First, $X$ and $Y$ must be conditionally independent given $Z$ , or, put in a more enlightening way, ${I(Z;Y)=I(X;Y)}$ . And second, for any other sufficient statistic $T$ , ${I(X;T)\geq I(X;Z)}$ . Intuitively, a minimally sufficient statistic is the most efficient description of $X$ which retains all of the available information about the class variable $Y$ . Further reasons that we wish to find a minimally sufficient statistic will become clear in the following sections.

III-B Information and Generalization

We now focus on the reason for caring about the first aspect of finding a minimally sufficient statistic. That is, on finding a variable such that ${I(Z;Y)=I(X;Y)}$ , or, in a more relaxed form, at least ensuring finding one such that $I(Z;Y)$ is relatively large. Pursuing this goal is backed by information theory as well as standard estimation theory. On the estimation theory side, this property just amounts to ensuring that $Z$ be a sufficient statistic for $X$ and $Y$ . It thus has importance in finding optimal estimators, for example, through the Rao-Blackwell theorem [6]. On the information theoretic side, if $I(Y;Z)=H(Y)$ , then having an instance $z$ would completely determine the corresponding instance $y$ , and so there exists an estimator of $Y$ that takes $Z$ as input and has zero probability of error. This notion can be expanded to $I(Y;Z)<H(Y)$ by Fano’s inequality and its generalizations [7] [8]. Fano’s inequality provides the following bound on estimation error for any estimator of $Y$ defined as a function of $Z$ :

[TABLE]

where $P_{e}$ is the error rate of the estimator and $h_{2}$ denotes the binary entropy function ${h_{2}(t)=-tlog_{2}(t)-(1-t)log_{2}(1-t)}$ . This inequality has a left hand side (LHS) that is strictly increasing in $P_{e}$ for $P_{e}\leq\frac{1}{2}$ . Thus the restriction of the LHS to $[0,\frac{1}{2}]$ is invertible, and since $H(Y)$ is fixed, we can say that $P_{e}$ is lower bounded by a monotonically decreasing function of $I(Y;Z)$ . In some cases we do achieve near equality in (1) - particularly when 1.) the estimator performs (nearly) equally well on each class and 2.) the estimator ${Z\to\hat{Y}}$ incurs relatively low levels of compression when compared to that which was incurred in the map ${X\to Z}$ .

III-C Information Losses

We now turn to the reason for caring about the second aspect of finding a minimally sufficient statistic - the minimality. This is where the role of our sampled data comes into play, and with it, the concept of information losses.

When we train on a finite sample of data, achieving the first aspect of a minimally sufficient statistic - the sufficiency - becomes difficult. This is because, no matter what representation we choose, we always have an information loss of the form:

[TABLE]

(The superscript (1) here is to distinguish between this form of information loss and another form which will appear later. We will call the current form type one information losses). In choosing our representation, we will only be able to control the latter term in this expression, as that term corresponds to the model we have estimated from our training data. Thus, if this loss is large, then, no matter what we do, we will have trouble in making $I(Y;Z)$ as large as possible.

Throughout this paper, we will find that this term, $I_{Loss}^{(1)}$ , depends on $I(X;Z)$ . In the old bounds (i.e. previous to this paper), its dependence is exponential [9]:

[TABLE]

where $m$ is the number of training samples. And so we see that, at least in this form, keeping $I(X;Z)$ low is pertinent.

In this paper, we will find that the dependence on $I(X;Z)$ is relaxed to a linear one. Thus it may not always be so clear that we should minimize $I(X;Z)$ . A perhaps more illuminating perspective can be found if we transfer instead to what we call type two information losses. These relate the best possible representation (in terms of achieving sufficiency) to the one that we would obtain by optimizing $Z$ jointly with our estimated probability distribution. Before describing this new type of information loss, we will need to rigorously define the representations that we qualitatively described in the previous sentence.

Definition 1.

Let $\epsilon>0$ . We denote as $Z_{\epsilon}^{*}(I)$ and $\hat{Z}_{\epsilon}(I)$ any random variables that are at most ${\epsilon\text{-suboptimal}}$ for the following information bottleneck problems respectively:

[TABLE]

We will then define type two information losses as

[TABLE]

which is, in general, a function of ${I\triangleq I(X;Z)}$ . Then, rearranging, we see that the quantity we care about, $I(Y;\hat{Z}_{\epsilon}(I))$ , is given by ${I(Y;Z_{\epsilon}^{*}(I))-I_{Loss,\epsilon}^{(2)}(I)}$ , and so picking an $I(X;Z)$ that maximizes this expression is critical, though it may not always result in a direct minimization of $I(X;Z)$ .

In any case, it is easy to convert bounds on type one information losses into corresponding bounds on type two information losses, as we will see in the next lemma.

Lemma 1.

Suppose that we have a bound of the form ${I_{Loss}^{(1)}\leq K(\cdot)}$ , where $K(\cdot)$ can be any function of any number of arguments. Then:

[TABLE]

III-D Automatic Implementation via Neural Networks

There is evidence [4][3] that neural networks automatically solve the information bottleneck problem. The first set of evidence is experimental. Authors of [4] found that a wide range of neural networks undergo training in two phases. In the first phase, the neural networks memorized the inputs. This corresponded to an increase of $I(X;Z)$ and $I(Y;Z)$ simultaneously. During this phase, the average magnitude of back-propagated gradients surpassed the variance. In the second phase, this dynamic swapped and the variance surpassed the average. During this phase, $I(Y;Z)$ increased, but $I(X;Z)$ dropped - the neural networks were compressing the input to learn more about $Y$ .

The second set of evidence is theoretical. The authors of [3] show that $I(X;Z)$ is tightly related to the information between the weights and the data $I(W;\mathcal{D}^{l})$ . This relationship holds with only a few assumptions on the corresponding neural network. They then shown that $I(W;\mathcal{D}^{l})$ is small when the network converges to a wide local minimum of the cross entropy loss function. Finally, they argue that stochastic gradient descent tends to converge to such minima.

Some more recent experimental evidence [5] counters these two arguments. This new evidence shows that some networks can achieve high $I(Y;Z)$ without compression. Thus some networks can significantly outperform the lower bound of inequality (3). This paper presents new lower bounds which are much tighter and less sensitive to $I(X;Z)$ than (3). These bounds - while useful on their own right- help to explain this counter evidence.

IV New Bounds on Information Losses

We will now move on to deriving the new bounds on information losses.

IV-A Product Form Decomposition - Intuition and Setup

Our first major step is a decomposition of information losses into a product of two terms, one being $I(X;Z)$ , and the other being a term related to a statistical distance between $\mathbb{P}$ and $\hat{\mathbb{P}}$ . The proof of this decomposition takes some setting up. The setup is performed by generalizing the well studied maximal coupling [10] from statistics to our purposes. We will call our generalization the conditional maximal coupling, and will begin its construction by quickly reviewing couplings in general [11].

Definition 2 (Coupling).

Given two probability models $\mathbb{P}_{\tilde{S}}$ and $\mathbb{Q}_{S}$ on a list of variables $S$ , a coupling of these models is a pair of random variables $(\tilde{S},\hat{S})$ with joint distribution $\gamma_{\tilde{S},\hat{S}}$ such that the marginal distributions satisfy $\gamma_{\tilde{S}}=\mathbb{P}_{\tilde{S}}$ and $\gamma_{\hat{S}}=\mathbb{Q}_{S}$ .

{const}

[Conditional Maximal Coupling] We set our coupling $\left((\tilde{X},\tilde{Y},\tilde{Z}),(\hat{X},\hat{Y},\hat{Z})\right)$ as follows. First, define the function ${m_{l}:\mathcal{X}\times\mathcal{Y}\to[0,1]}$ through

[TABLE]

Next, define a real number $\rho$ as

[TABLE]

and define $J$ as a Bernoulli random variable with success probability $\rho$ . Then define variables ${U=(U_{1},U_{2}),V=(V_{1},V_{2})}$ and ${W=({W_{1},W_{2}})}$ through

[TABLE]

Next define $(\tilde{X},\tilde{Y},\hat{X},\hat{Y})$ as functions of the above random variables as follows:

[TABLE]

Finally, we define $\tilde{Z}$ and $\hat{Z}$ through

[TABLE]

Lemma 2.

Construction 2 yields a valid coupling.

Lemma 3.

The definitions of Construction 2 satisfy the following relationship:

[TABLE]

Motivated by Lemma 3, we will denote $1-\rho$ as $\bar{\delta}(\hat{\mathbb{P}})$ . This notation emphasizes its role as an average total variation distance. This finishes our setup for the decomposition, which we will now move on to prove.

IV-B Product Form Decomposition - Theorem and Proof

Theorem 1.

[TABLE]

Proof.

We will use several Markov chains in this proof. All of them follow from the following Bayesian network describing the generative process of all relevant random variables which is shown in figure 2. Each Markov chain that we use comes from the fact that the $X$ variables d-separate the $Z$ variables from the rest of the network.

First, via coupling, we have

[TABLE]

We decompose the above terms as follows:

[TABLE]

But, due to the Markov chains ${\tilde{Z}-\tilde{X}-\tilde{Y}}$ and ${\hat{Z}-\hat{X}-\hat{Y}}$ , we have ${I(\tilde{Y};\tilde{Z}|\tilde{X})=I(\hat{Y};\hat{Z}|\hat{X})=0}$ . Furthermore, ${I(\tilde{X};\tilde{Z})=I(\hat{X};\hat{Z})=I(X;Z)}$ , so:

[TABLE]

We can further decompose each of these terms as:

[TABLE]

But we have from the Markov chains ${\hat{Z}-\hat{X}-J}$ and ${\tilde{Z}-\tilde{X}-J}$ that ${I(\hat{Z};J|\hat{X},\hat{Y})=I(\tilde{Z};J|\tilde{X},\tilde{Y})=0}$ , so these terms will disappear from the decomposition. Next, we can break down the term $I(\hat{Z};\hat{X}|J,\hat{Y})$ to:

[TABLE]

and similarly, we can break down:

[TABLE]

But when ${\tilde{X}=\hat{X}=U_{1}}$ , ${I(\hat{Z};U_{1}|U_{2})=I(\tilde{Z};U_{1}|U_{2})}$ . Thus, in total, ${\left|I(Y;Z)-\hat{I}(Y;Z)\right|}$ is given by:

[TABLE]

which can be bounded by the triangle inequality on each inner term.

Now, from the Markov chains ${\hat{Z}-\hat{X}-W_{1}}$ , ${\hat{Z}-\hat{X}-W_{2}}$ , ${\tilde{Z}-\tilde{X}-V_{1}}$ , and ${\tilde{Z}-\tilde{X}-V_{2}}$ , we have (via applications of the data processing inequality and its corollaries [7]):

[TABLE]

Further, $I(\hat{Z};J|\hat{Y})\leq H(J)$ and $I(\tilde{Z};J|\tilde{Y})\leq H(J)$ . Then as ${0\leq a\leq c~{}\wedge~{}0\leq b\leq c\implies|a-b|\leq c}$ , we have

[TABLE]

And so, in total, we have

[TABLE]

which completes the proof. ∎

A potentially useful special case of this bound occurs when we set $Z=X$ :

Corollary 1.

If $X$ is discrete,

[TABLE]

But we won’t be using this corollary in the rest of the paper.

IV-C Understanding $\bar{\delta}(\hat{\mathbb{P}})$

The above relationships looks linear on $I(X;Z)$ . However, $\hat{p}(y|x)$ is typically learned jointly with $Z$ and therefore $\bar{\delta}(\hat{\mathbb{P}})$ may itself depend on $I(X;Z)$ . Thus we cannot yet say that this relationship is truly linear, and we certainly cannot yet say that it is tight. Before we can make those claims, we will need to study $\bar{\delta}(\hat{\mathbb{P}})$ explicitly. We will begin with a ‘sanity-check’ lemma. This lemma shows us that $\bar{\delta}(\hat{\mathbb{P}})$ does at least converge with the convergence of a typical neural classifier loss function. It arises from an application of Pinsker’s inequality [12].

Lemma 4.

Suppose that $H(Y|X)=0$ . Then:

[TABLE]

where $H_{\mathbb{P},\hat{\mathbb{P}}}(Y|X)$ is the conditional cross entropy between $\mathbb{P}$ and $\hat{\mathbb{P}}$ , i.e. the usual cross entropy loss function.

This lemma is particularly applicable when we are estimating our cross entropy error on a validation set, as we can then take $\mathbb{P}$ in this lemma to be the empirical measure corresponding to the validation or training sample, in which we are almost certain to have ${H(Y|X)=0}$ . In this sense Lemma 4 can bound such empirical estimates of ${\bar{\delta}(\hat{\mathbb{P}})}$ .

IV-D Bounding $\bar{\delta}(\hat{\mathbb{P}})$ - Setting

Finally, we will derive a rate of decrease for $\bar{\delta}(\hat{\mathbb{P}})$ in a general continuous learning algorithm. Our setup will involve defining a learning algorithm as a continuous map from a special topology on input probability measures on $\mathcal{X}\times\mathcal{Y}$ to conditional probability functions. This is basically to say that, given a training dataset (i.e. an empirical measure on $\mathcal{X}\times\mathcal{Y}$ ), we have a well-behaved way of obtaining the corresponding $\hat{p}_{\nu}(y|x)$ . This is just slightly generalized so that we can consider any input measure (empirical or not) as a ‘training dataset’. We begin by reviewing that special topology, and then we will construct the topology that we will place on our output conditional probability distributions.

Definition 3.

Let $M_{1}$ denote the set of Borel probability measures on $\mathcal{X}\times\mathcal{Y}$ . Then the $\tau$ -topology [13] (page 263) is the topology generated by the sets ${W_{f,r,c}=\{\nu:|\int fd\nu-r|<c\}}$ for all bounded Borel measurable functions ${f:\mathcal{X}\times\mathcal{Y}\to\mathbb{R}}$ , all $r\in\mathbb{R}$ and all $c>0$ . If we restrict $f$ to bounded continuous functions, we get the weak topology $\mathcal{W}$ , which is strictly coarser than the $\tau$ -topology.

Definition 4.

Let $\Sigma_{|\mathcal{Y}|}$ be the probability simplex in $|\mathcal{Y}|$ dimensions. Let $L^{1}(\mathcal{X})$ denote the space of absolutely integrable functions from $\mathcal{X}$ to $\mathbb{R}$ with norm ${\|f\|_{L_{1}}=\int|f|d\mathbb{P_{X}}}$ . Let $L^{1}(\mathcal{X})^{|\mathcal{Y}|}$ denote the product space on $L^{1}(\mathcal{X})$ , consisting of functions from $\mathcal{X}$ to $\mathbb{R}^{|\mathcal{Y}|}$ which are absolutely integrable in each output dimension, and with norm ${\|f\|_{L^{|\mathcal{Y}|}_{1}}=\frac{1}{2}\int\sum_{y}|f(x,y)|d\mathbb{P_{X}}}$ . Finally, let $L^{1}(\mathcal{X},\Sigma_{|\mathcal{Y}|})$ denote the subspace of $L^{1}(\mathcal{X})^{|\mathcal{Y}|}$ to the set of functions whose co-domain is $\Sigma_{|\mathcal{Y}|}$ .

The topology we’ve placed on $L^{1}(\mathcal{X},\Sigma_{|\mathcal{Y}|})$ is metrized by the conditional total variation function that we’ve been working with. With these topologies defined, we will restrict ourselves to the study of algorithms which act as continuous maps between these topologies. This essentially requires that, when our training datasets are very similar (e.g. moving one training point to a point within a distance $\epsilon$ from the original), our algorithm will return very similar output functions in terms of conditional total variation. Thus this condition is somewhat related to algorithmic stability [14], though not completely equivalent.

We will obtain two bounds on $\bar{\delta}(\hat{\mathbb{P}})$ in the remains of this paper. The first is asymptotic, and applies when we have continuity from the $\tau$ -topology. The second is non asymptotic, and applies when we further have continuity from the weak topology. We will next show that gradient descent algorithms, under mild conditions, achieve these continuities.

Theorem 2.

Let $\Theta$ denote a normed parameter space and let ${\mathcal{L}:\mathcal{X}\times\mathcal{Y}\times\Theta\to\mathbb{R}}$ denote a loss function which is integrable in $\mathcal{X}\times\mathcal{Y}$ for each $\theta\in\Theta$ , which is differentiable with respect to $\theta$ for all ${(x,y)\in\mathcal{X}\times\mathcal{Y}}$ , and whose $\theta$ -gradients yield bounded continuous functions on $\mathcal{X}\times\mathcal{Y}$ when evaluated at each point ${\theta\in\Theta}$ . Suppose further that our parameter space admits lipschitz-continuous outputs for each $(x,y)$ . That is, ${|p_{\theta_{1}}(y|x)-p_{\theta_{2}}(y|x)|<L\|\theta_{1}-\theta_{2}\|~{}\forall(x,y)\in\mathcal{X}\times\mathcal{Y}}$ . Then gradient descent applied to the empirical risk minimization of $\mathcal{L}$ , with a fixed initiation $\theta^{(0)}$ and which proceeds for a fixed number of iterations, is continuous from $(M_{1},\mathcal{W})$ to $L^{1}(\mathcal{X},\Sigma_{|\mathcal{Y}|})$ .

If we relax the condition that the $\theta$ gradients of $\mathcal{L}$ be bounded continuous functions on $\mathcal{X}\times\mathcal{Y}$ when evaluated at each point ${\theta\in\Theta}$ to just bounded measurable functions, then this algorithm is still continuous from $(M_{1},\tau)$ to $L^{1}(\mathcal{X},\Sigma_{|\mathcal{Y}|})$ .

Proof.

The assumptions on $\mathcal{L}$ allow us to differentiate (with respect to $\theta$ ) under the integral sign. Let $\alpha_{k}$ denote the step size of the $k^{th}$ iteration. Let $\nu^{*}\in M_{1}$ . We proceed by induction on the number of iterations.

Let $\epsilon>0$ . Let ${\delta_{1}=\frac{2\epsilon}{L\alpha_{1}|\mathcal{Y}|}}$ . Let $\nu^{*}\in M_{1}$ and let $\nu$ be contained in the open set of the weak topology given by ${\{\nu:|\int\nabla_{\theta^{(0)}}d\nu-\int\nabla_{\theta^{(0)}}d\nu^{*}|<\delta_{1}\}}$ (which clearly contains $\nu^{*}$ ). Let $\theta_{*}^{(1)}$ denote the parameter chosen after one gradient update when training on $\nu^{*}$ , and let $\theta^{(1)}$ denote the parameter chosen after one gradient update when training on $\nu$ . Then:

[TABLE]

so

[TABLE]

and so the hypothesis is true if our algorithm consists of one iteration.

Suppose that the hypothesis when we use $(k-1)$ iterations. Let $\epsilon>0$ . Let ${\delta_{k-1}=\frac{\epsilon}{L|\mathcal{Y}|}}$ and let ${\delta_{k}=\frac{\epsilon}{L|\mathcal{Y}|\alpha_{k}}}$ . Chose an open set $U$ of the weak topology such that ${\|\theta_{*}^{(k-1)}-\theta_{c}^{(k-1)}\|\leq\delta_{k}}$ when ${\nu_{c}\in U}$ which is possible by the induction hypothesis, and where $\theta_{*}^{(k-1)}$ and $\theta_{c}^{(k-1)}$ denote the chosen parameters after iteration ${k-1}$ of the gradient descent when trained on $\nu^{*}$ and $\nu_{c}$ . Let ${\nu\in U\cap\{\nu:|\int\nabla_{\theta^{(k-1)}}d\nu-\int\nabla_{\theta^{(k-1)}}d\nu^{*}|<\delta_{k}\}}$ . Then by the triangle inequality:

[TABLE]

so the conditional total variation between $p_{\theta_{*}^{(k)}}(y|x)$ and $p_{\theta^{(k)}}(y|x)$ is less than or equal to ${\frac{L|\mathcal{Y}|(\delta_{k-1}+\alpha_{k}\delta_{k})}{2}}$ which is equal to $\epsilon$ .

For the final statement, note that all of the above open sets in the $\mathcal{W}$ -topology used in this proof remain open sets in the $\tau$ -topology when we relax the conditions of $\mathcal{L}$ . This completes the proof. ∎

IV-E Bounding $\bar{\delta}(\hat{\mathbb{P}})$ - The Asymptotic Case

We now wish to bound the conditional total variation of an estimated model against the true model when we use such a general learning algorithm in our setting. We will re-label $\bar{\delta}(\hat{\mathbb{P}})$ to $\bar{\delta}(\mathbb{P}_{f})$ to emphasize that our estimated model is coming from such an algorithm. We then have the following asymptotic theorem on the rate of decay for $\bar{\delta}(\mathbb{P}_{f})$ . This will apply whenever we have continuity from the $\tau$ -topology in our algorithm, and will be used in our non-asymptotic specialization that follows. We will use two final lemmas in both of those proofs.

Lemma 5.

Let $(\Omega,\mathcal{F},\mu)$ be a probability space and let ${h:\Omega\to\mathbb{R}}$ be bounded and measurable. Let $\mathcal{G}$ denote the set of non-negative measurable functions with expectation $1$ . Then ${\underset{g\in\mathcal{G}}{\inf~{}}\mathbb{E}\left[g\cdot(h+log~{}g)\right]=-log~{}\mathbb{E}\left[e^{-h(\omega)}\right]}$ .

Lemma 6.

Let $(\Omega,\mathcal{F},\mu)$ be a probability space and let ${f:\Omega\to\mathbb{R}}$ be bounded and measurable with ${Range(f)\subseteq[0,1]}$ . Then ${log~{}\left(\mathbb{E}\left[e^{-2f^{2}}\right]\right)\leq-2\mathbb{E}\left[f\right]^{2}}$ .

Theorem 3.

Let ${\epsilon\in(0,1)}$ , and let ${0<\zeta<1}$ . If $\mathcal{F}$ is a continuous learning algorithm from $(M_{1},\tau)$ to ${L^{1}(\mathcal{X},\Sigma_{|\mathcal{Y}|})}$ such that, for any $\nu\in M_{1}$ , the total variation between $\mathcal{F}\nu$ and $\nu_{y|x}$ is smaller than the total variation between $\mathcal{F}\nu$ and $p_{y|x}$ at any point in the support of $\nu$ . Suppose further that the ‘training’ total variation, ${\mathbb{E}_{\nu}\left[\frac{1}{2}\sum_{y}|\nu_{y|x}-\mathcal{F}\nu|\right]}$ , is bounded above by $\zeta$ . Then:

[TABLE]

where ${\mathbb{P}^{m}}$ is the probability measure on $M_{1}$ induced by the sampling of $m$ data-points on $\mathcal{X}\times\mathcal{Y}$ .

Proof.

For notational convenience, we will denote as $\delta_{\nu}(x)$ the conditional total variation between $p(y|x)$ and $(\mathcal{F}\nu)(y|x)$ for a fixed $x$ .

We will first need to show that the map ${\bar{\delta}:M_{1}\to\mathbb{R}}$ , given by ${\nu\mapsto\mathbb{E}_{\mathbb{P}_{X}}\left[\delta_{\nu}\right]}$ is continuous from the $\tau$ -topology to the Euclidean topology. This is trivial since $\mathbb{E}_{\mathbb{P}_{X}}\left[\delta_{\nu}\right]$ is just the composition of $\mathcal{F}$ , which was assumed continuous, with the fixed-point distance function $d(\cdot,p_{y|x}(y|x))$ defined over ${L^{1}(\mathcal{X},\Sigma_{|\mathcal{Y}|})}$ .

Now, let ${\Gamma=\{\nu\in M_{1}:\mathbb{E}_{\mathbb{P}_{X}}\left[\delta_{\nu}\right]\geq\epsilon\}}$ . By the above continuity and by the fact that $[\epsilon,1]$ is closed in $\mathbb{R}$ , we have that $\Gamma$ is closed. Then, by Sanov’s Theorem [13]:

[TABLE]

We thus wish to lower bound $\mathcal{D}_{KL}(\nu||p(x,y))$ over $\Gamma$ . We begin by decomposing $\frac{d\nu}{d\mathbb{P}}$ into ${\frac{d\nu_{x}}{d\mathbb{P}_{x}}\frac{\nu_{y|x}}{p_{y|x}}}$ . Where $\nu_{x}$ and $\mathbb{P}_{x}$ are the marginal distributions of $\nu$ and $p(x,y)$ on $\mathcal{X}$ . We are guaranteed that the functions and $\nu_{y|x}$ exist on the support of $\nu_{x}$ since $y$ is discrete. The KL-divergence then becomes: ${\mathcal{D}_{KL}(\nu||p(x,y))=\mathbb{E}_{\mathbb{P}_{X}}\left[\frac{d\nu_{x}}{d\mathbb{P}_{x}}(\tilde{h}+log\frac{d\nu_{x}}{d\mathbb{P}_{x}})\right]}$ where ${\tilde{h}\triangleq\sum_{y}\nu_{y|x}log~{}\frac{\nu_{y|x}}{p_{y|x}}}$ is bounded below (via Pinsker’s inequality) by the function $2\left(\sum_{y}|p_{y|x}-\nu_{y|x}|\right)^{2}$ , which itself is bounded below by $2\left(\sum_{y}|p_{y|x}-\mathcal{F}\nu|-\sum_{y}|\nu_{y|x}-\mathcal{F}\nu|\right)^{2}$ because the absolute value of the second term in this expression is smaller than that of the first term for each point in the support of $\nu$ . The first term is just the function $\delta_{\nu}$ defined at the start of this proof. We will call the second term ${\delta_{\nu}^{t}}$ . We can lower bound this expression one more time with ${2\delta_{\nu}^{2}-4\delta_{\nu}^{t}}$ . We are left with:

[TABLE]

We will bound these two remaining terms separately. The second is taken care of in this theorem’s hypothesis, being bounded below by $-4\zeta$ . For the latter, we can combine Lemmas 5 and 6 to obtain a lower bound of $2\epsilon^{2}$ (since $\nu\in\Gamma$ ).

Since neither of these two bounds depend on $\nu$ , negating their sum yields the result. ∎

IV-F Bounding $\bar{\delta}(\hat{\mathbb{P}})$ - The Non-Asymptotic Case

The previous theorem gives us:

[TABLE]

where $o(m)$ refers to any terms such that ${\underset{m\to\infty}{\text{lim }}{\frac{o(m)}{m}}=0}$ . We will need to study $o(m)$ since it’s somewhat of an unknown here, and may be large for small $m$ . The next theorem, which is non-asymptotic, will take care of this when $\mathcal{F}$ is continuous from the weak topology.

Theorem 4.

Take all assumptions from Theorem 3, but remove the assumption that $\mathcal{F}$ be a continuous map from $(M_{1},\tau)$ to ${L^{1}(\mathcal{X},\Sigma_{|\mathcal{Y}|})}$ and assume it is instead continuous linear from ${(M_{1},\mathcal{W})}$ . Suppose further that $\mathcal{X}$ is compact, and that $p(x)$ has full support with density ${p(x,y)>0}$ everywhere. Then there exists a function ${k(m^{\prime}):\mathbb{Z}^{+}\to\mathbb{R}}$ with ${k(m^{\prime})\leq\sqrt{m^{\prime}}}$ such that:

[TABLE]

(A more detailed description of $k(m^{\prime})$ , from which we can discover more of its properties, is contained in the proof).

Proof.

Let the notations $\delta_{\nu}$ and $\Gamma$ be defined as they were in the proof of Theorem 3.

Let $E(S_{m^{\prime}},k(m^{\prime}))$ constitute a family of conditions, indexed first by samples of $m^{\prime}$ points of $\mathcal{X}$ and second by functions ${\mathbb{Z}^{+}\to\mathbb{R}}$ , which constitute that ${|\mathbb{E}_{p(x)}\left[\delta_{\nu}\right]-\mathbb{E}_{S_{m^{\prime}}}\left[\delta_{\nu}\right]|\leq\frac{k(m^{\prime})}{\sqrt{m^{\prime}}}}$ , where the second expectation is the monte-carlo estimate over the indexed sample.

Let the sets $\Gamma(S_{m^{\prime}},i)$ , indexed first over samples of $\mathcal{X}$ consisting of $m^{\prime}$ points and second over the set ${1,2,\cdots,2^{m^{\prime}|\mathcal{Y}|}}$ , be given by ${\Gamma(S_{m^{\prime}},i)=\{h:\mathbb{E}_{p(x)}\left[\delta_{h}\right]\geq\epsilon,~{}\mathcal{F}h(y|x_{j})\geq/\leq p_{y|x}(y|x_{j})\}}$ (where the ${x_{j}^{\prime}s}$ run over the sampled points in ${S_{m^{\prime}}}$ and $i$ runs over the possible choices of ${\geq/\leq}$ ). Let ${F(S_{m}^{\prime},i,k(m^{\prime}))}$ denote the family of conditions ${\{\nu:\mathbb{E}_{S_{m^{\prime}}}\left[\delta_{\nu}\right]\geq\epsilon-\frac{k(m^{\prime})}{\sqrt{m^{\prime}}},~{}\mathcal{F}\nu(y|x_{j})\geq/\leq p_{y|x}(y|x_{j})\}}$ where the $x_{j}$ run over the sampled points and the choices of $\geq$ and $\leq$ correspond to those of $\Gamma^{i}$ . Let $G(S_{m^{\prime}},i)$ denote the condition on measures ${\mu\in M_{1}}$ such that there exists a measure ${\mu^{\prime}\in\Gamma(S_{m^{\prime}},i)}$ with ${\mu^{\prime}_{y|x}=\mu_{y|x}}$ . Note that ${E(S_{m^{\prime}},k(m^{\prime}))\cap G(S_{m^{\prime}},i)\subseteq F(S_{m}^{\prime},i,k(m^{\prime}))}$ .

Let $M$ denote the vector space of finite signed measures on ${\mathcal{X}\times\mathcal{Y}}$ endowed with the weak topology. For any probability measure ${\nu^{\prime}_{x}\in M_{1}(\mathcal{X})}$ , let $R^{\nu^{\prime}_{x}}$ be the subspace of measures with marginal distribution $\nu^{\prime}_{x}$ . Let ${R_{1}^{\nu^{\prime}_{x}}}$ be the subset of $R^{\nu^{\prime}_{x}}$ consisting of probability measures. Define a linear map on ${R_{1}^{\nu^{\prime}_{x}}}$ , denoted $\mathcal{C}_{\nu^{\prime}_{x}}$ , which takes $\nu^{\prime}$ to its disintegration $\nu^{\prime}_{y|x}$ .

Let ${f_{\nu^{\prime}_{x}}:M_{1}\times\mathcal{C}_{\nu^{\prime}_{x}}R_{1}^{\nu^{\prime}_{x}}}$ denote the family of real valued function (indexed by $M_{1}(\mathcal{X})$ ) taking ${(\nu,\nu_{y|x}^{\prime})}$ to the value ${\mathbb{E}_{\nu_{x}}\left[\sum_{y}\nu_{y|x}log\frac{\nu^{\prime}_{y|x}}{p_{y|x}}+log~{}\frac{d\nu_{x}}{d\mathbb{P}_{X}}\right]}$ , which is to be taken as infinite when the support of $\nu_{x}^{\prime}$ is not a superset of the support of $\nu_{x}$ , and is further infinite when $\nu_{x}$ is not absolutely continuous with respect to $p(x)$ . Note that each ${f_{\nu^{\prime}_{x}}(\cdot,a)}$ is convex and continuous in the weak topology for each fixed $a$ (as ${p(x)>0}$ and ${p_{y|x}>0}$ everywhere by the theorem’s hypothesis), and each ${f_{\nu^{\prime}_{x}}(b,\cdot)}$ is concave and continuous for each fixed $b$ .

Now, since ${\mathcal{X}\times\mathcal{Y}}$ is compact, $M_{1}$ is compact in the weak topology. Then for any $\nu^{\prime}_{x}$ , $R_{1}^{\nu^{\prime}_{x}}$ is compact (being a closed subset of a compact space). Then ${\mathcal{C}_{\nu^{\prime}_{x}}R_{1}^{\nu^{\prime}_{x}}}$ is compact and convex. We also have that the subsets ${G(S_{m^{\prime}},i)}$ , ${E(S_{m^{\prime}},k(m^{\prime}))}$ , and ${F(S_{m^{\prime}},i,k(m^{\prime}))}$ are all closed, and therefore compact. We also have convexity in ${F(S_{m^{\prime}},i,k(m^{\prime}))}$ , but not in the other two.

Arbitrarily pick some ${\nu{{}^{\prime\prime}}_{x}\in M_{1}}$ with full support and denote $f$ as $f_{\nu^{\prime\prime}_{x}}$ as $f$ . Let ${r(S_{m^{\prime}},i,k(m^{\prime}))}$ denote the minimum of the expression ${f(a,a_{y|x})}$ over ${K(S_{m^{\prime}},i)\cap E(k(m^{\prime}))\cap F(S_{m^{\prime}},i,k(m^{\prime}))}$ and denote the minimizer as ${a(S_{m^{\prime}},i,k(m^{\prime}))}$ . The image of the map ${f(\cdot,a(S_{m^{\prime}},i,k(m^{\prime})))}$ is a compact subset of $\mathbb{R}$ - i.e. a closed and bounded interval ${\mathcal{I}(S_{m^{\prime}},i,k(m^{\prime}))}$ . Let ${\tilde{\mathcal{I}}(S_{m^{\prime}},k(m^{\prime}))}$ denote the union of these intervals over the finite indices $i$ . Cover this interval with a family of subintervals ${\tilde{\mathcal{I}}(S_{m^{\prime}},k(m^{\prime}),j)}$ of size $\frac{k(m^{\prime})}{\sqrt{m^{\prime}}}$ .

We will now fix $k(m^{\prime})$ to be the smallest number such that there exists a sample $S^{*}_{m^{\prime}}$ in which both ${G(S^{*}_{m^{\prime}},i)\cap E(S^{*}_{m^{\prime}},k(m^{\prime}))\neq\emptyset}$ for all $i$ in which ${G(S^{*}_{m^{\prime}},i)\neq\emptyset}$ and ${\mathcal{I}(S^{*}_{m^{\prime}},k(m^{\prime}),j)\cap E(S^{*}_{m^{\prime}},k(m^{\prime}))\neq\emptyset}$ for all $j$ in which ${\tilde{\mathcal{I}}(S^{*}_{m^{\prime}},k(m^{\prime}),j)\neq\emptyset}$ . Such a $k(m^{\prime})$ exists, and is less than or equal to $\sqrt{m^{\prime}}$ since $E(S_{m^{\prime}},\sqrt{m^{\prime}})$ is all of $M_{1}$ . Fix $S_{m^{\prime}}$ to any of the samples that we just established the existence of. We will drop the notations $S_{m^{\prime}}$ and $k(m^{\prime})$ from the notation for any conditions referring to them from now on.

Now, denote as $C_{b}(\mathcal{X})$ the set of bounded continuous functions from $\mathcal{X}$ to $\mathbb{R}$ and construct a family of maps ${\mathcal{G}_{\lambda,\nu^{\prime}}:M_{1}\to\mathbb{R}}$ indexed over ${\lambda\in C_{b}(\mathcal{X})}$ and ${\nu^{\prime}\in M_{1}}$ which takes ${\nu\in M_{1}}$ to ${\mathbb{E}_{\nu}\left[mlog\frac{\nu^{\prime}_{y|x}}{p_{y|x}}+m\lambda\right]}$ . Then for any empirical ${L_{m}\in\Gamma(i)}$ corresponding to a sample of $m$ points, we have that ${\mathcal{G}_{\lambda,\nu^{\prime}}L_{m}\geq\underset{\nu\in\Gamma(i)}{\inf}\mathcal{G}_{\lambda,\nu^{\prime}}\nu}$ for all $\lambda,\nu^{\prime}$ . Thus the probability that $L_{m}$ is in $\Gamma(i)$ is bounded above by the probability that ${\mathcal{G}_{\lambda,\nu^{\prime}}L_{m}-\underset{\nu\in\Gamma(i)}{\inf}\mathcal{G}_{\lambda,\nu^{\prime}}\nu\geq 0}$ . Then by Chernoff’s inequality, we have that ${\frac{1}{m}log~{}\mathbb{P}^{m}\left({L_{m}\in\Gamma(i)}\right)}$ is bounded above by:

[TABLE]

where the first expectation is taken over $\mathbb{P}^{m}$ .

The first term can be reduced to ${log~{}\mathbb{E}_{p(x)}\left[e^{\lambda}\right]}$ . Optimizing over $\lambda$ yields a bound of

[TABLE]

We will denote as $\Gamma^{i}_{y|x}$ the set of conditional probability functions ${\nu_{y|x}}$ such that there exists $\nu\in\Gamma(i)$ with disintegration given by $\nu_{y|x}$ . We will also denote a function $g_{\nu^{\prime}}(\nu_{y|x},\mu_{x})$ defined on ${\Gamma^{i}_{y|x}\times M_{1}(\mathcal{X})}$ which yields ${\mathbb{E}_{\mu_{x}\nu_{y|x}}\left[log\frac{\nu^{\prime}_{y|x}}{p_{y|x}}\right]}$ when the support of the latter argument is equal to the domain of the former, and is infinite otherwise. Note that $g$ is convex and lower-semicontinuous in $\mu_{x}$ for fixed $\nu_{y|x}$ since it is linear in the convex subset ${\{\mu_{x}\in M_{1}(\mathcal{X}):supp(\mu_{x})=Dom(\nu_{y|x})\}}$ and infinite outside of this subset. Finally, we will define the function ${h:M_{1}(\mathcal{X})\times C_{b}(\mathcal{X})\to\mathbb{R}}$ given by ${h(\mu_{x},\lambda)=\mathbb{E}_{\mu_{x}}\left[\lambda\right]-log(\mathbb{E}_{p(x)}\left[e^{\lambda}\right])}$ . This function is concave in $\lambda$ , convex in $\mu_{x}$ , and lower semicontinuous in $\mu_{x}$ [13]. Then (39) is upper bounded by:

[TABLE]

Note also that the the objective function of this expression is decoupled for $\nu_{y|x}$ and $\lambda$ . We can thus swap the supremum with the first infinum. But then inside the first infinum, we are left with an objective function in which a minimax theorem applies [15] because $M_{1}(\mathcal{X})$ is compact and convex in the weak topology when $\mathcal{X}$ is compact, and so we can swap the supremum with the second infinum as well. Since the first term does not depend on $\lambda$ , we can then consider for each fixed $\mu_{x}$ the expression ${\underset{\lambda}{\sup~{}}h(\mu_{x},\lambda)}$ . But the supremum of this function over ${\lambda\in C_{b}(\mathcal{X})}$ is none other than the $KL$ divergence between $\mu_{x}$ and $p(x)$ [16]. We are thus left with a full upper bound of (now optimizing over ${\nu_{y|x}^{\prime}\in\mathcal{C}_{\nu^{\prime\prime}_{x}}R_{1}^{\nu^{\prime\prime}_{x}}}$ ):

[TABLE]

We would be able to swap the supremum and infinum if our feasible set were convex and compact. This is true for our search space over $\nu^{\prime}$ , but not for $G(i)$ . Our goal is to then transform $G(i)$ into $F(i)$ , which is convex, with corresponding error terms included. This can be done by tightening $G(i)$ to ${G(i)\cap E}$ and then relaxing that set to $F(i)$ , this will incur some error, but if we end up choosing ${\nu_{y|x}^{\prime}}$ to be the disintegration of $a(i)$ , then this error will be bounded by $\frac{k(m^{\prime})}{\sqrt{m^{\prime}}}$ .

With our feasible set now being $F(i)$ , we can swap the supremum and infinum, and then pick $\nu^{\prime}_{y|x}$ to be equal to $\nu_{y|x}$ on the support of $\nu$ , and arbitrary elsewhere. The objective function is then just the minimum $KL$ divergence over $F(i)$ , which we know how to deal with due to the proof of Theorem 3. Minimizing then gives us ${\nu_{y|x}=\nu^{\prime}_{y|x}}$ both given by the disintegration of $a(i)$ , and with the objective function bounded by ${\underset{\nu\in F(i)}{\inf}2\mathbb{E}_{p(x)}\left[\delta_{\nu}\right]^{2}-4\zeta}$ . If we again add the constraint $E$ to the feasible region (with another error of at most $\frac{k(m^{\prime})}{\sqrt{m^{\prime}}}$ added on), then this is bounded above by ${2(\epsilon-2\frac{k(m^{\prime})}{\sqrt{m^{\prime}}})^{2}}$ . Union bounding over $i$ yields the result. ∎

IV-G Some Insights

We have established that, with probability at least $(1-\nu)$ , the following holds:

[TABLE]

where ${\delta^{\prime}=\frac{k(m^{\prime})}{\sqrt{m^{\prime}}}}$ and we can usually take ${\zeta\approx 0}$ (as we can make this arbitrarily small with a large enough network, due to [17] and lemma 4 if we train on cross-entropy errors). $k(m^{\prime})$ is trivially less than or equal to $m^{\prime}$ , but it is generally going to be quite small since it is dependent on a statement only requiring the existence of functions satisfying an empirical deviation bound. This is in contrast to classical statistical learning theory bounds which instead require for all functions statements of the same sort. Furthermore, $k(m^{\prime})$ is not strictly increasing with model complexity. On the contrary, $k(m^{\prime})$ can decrease as the hypothesis space grows (given that we maintain ${\mathcal{W}}$ continuity), since having more functions will increase the probability of such existences. By Theorem 3, we can also assume that ${\frac{k(m^{\prime})}{m^{\prime}}\to 0}$ as ${m^{\prime}\to 0}$ . These intuitions tell us that the decomposition in Theorem 1 has successfully extracted a good amount of the problem’s complexity into the term $I(X;Z)$ . The primary complexity term in ${\bar{\delta}(\mathbb{P}_{f})}$ - given a sufficiently complex hypothesis space - arises from the complexity of the class variable itself.

V Experiments

V-A How These Bounds Solve Experimental Discrepancy

We argue that the bounds presented in this paper explain the experimental discrepancy that we’ve alluded to a few times in this paper. These tightened, less sensitive bounds imply that, in many cases, it is simply not optimal in terms of information losses to compress a neural network’s input. This can be seen visually in Figure 3. Here we have set up a toy classification problem with ${H(Y)=log_{2}(10)}$ , ${H(X)=21}$ , and ${I(Y;Z^{*})=H(Y)\left(1-e^{-\frac{I(X;Z^{*})}{2}}\right)}$ . The information quantities in this toy example are thus similar to MNIST [18]. We have plotted $I(Y;Z^{*})$ along with the bounds of this paper (assuming ${\zeta\approx 0,k(m^{\prime})\approx 0}$ ) for ${m=10,000}$ , $5,000$ , and $2,000$ data points. We see that very little to nothing can be gained by compression in the $m=10,000$ and $m=5,000$ cases. Serious gains can only be obtained in the ${m=2,000}$ case. On the right side of this figure, we plot the old bounds, which predicts a peak at around $5$ bits even for ${10,000}$ data points. Thus the lack of compression found experimentally on smaller datasets is explained by our new bounds, but not by the old ones.

But if the entropy of the feature space becomes large, as we’ve made it for the third plot in this figure, compression becomes important even with our new bounds. This helps to explain why neural networks seem to yield compression on ‘harder’ datasets, but do not on ‘easier’ ones.

V-B Tightness of Bounds

For these experiments, we have used the MINE-f [19] estimator of mutual information for $I(X;Z)$ quantities. We assume that ${\hat{I}(Y;\hat{Z})}$ is equal to $H(Y)$ , and estimate ${I(Y;\hat{Z})}$ via validation error probability and Fano’s inequality. To make the classifier representation stochastic, we used permanent dropout with a rate of $0.7$ . All classifiers are trained for $10,000$ epochs, and all information estimations are performed for $2000$ epochs. All neural networks are trained with the Adam optimizer. All models used a learning rate of ${5\times 10^{-4}}$ .

We first tested the non-asymptotic bound of Theorem 4 on four of the datasets provided by OpenML [20] across several training data sizes (dependent on the overall size of the dataset in question). Our classifier consisted of a neural network with a single hidden layer of $1000$ units. The results are plotted in figure 4. We took a confidence interval ${\nu=0.5}$ for the plot of the bound, and plotted the mean value of ten experiments for the ‘true’ ${50\%}$ confidence interval (assuming a symmetric distribution). We estimated $k(m^{\prime})$ via $k_{c}m^{\prime r}$ with ${r<\frac{1}{2}}$ . In each case, we estimated $k_{c}$ and $r$ in sample for the smallest tested training data size. This, of course, only gives us a ‘functional behavior’ experiment, but we do see that this behavior is consistent with the true values.

We then tested the bound of Theorem 1 for MNIST and Cifar-10 using the true value of ${\bar{\delta}(\mathbb{P}_{f})}$ in each case. The results are shown in Figure 5. Each dataset is experimented on with a classifier given by a fully connected neural network with single hidden layer, with varying hidden layer sizes. The deviations here are to show that the bound is decent across differing architectures. The bound is quite close to the true confidence interval in each case.

VI Conclusion

This paper presented new bounds on information losses from finite data. This began in the form of a relationship between these losses, the expected total variation of the neural model, and the information held in the hidden representation of the feature space. Then, by bounding the total variation term without invoking any more dependence on model complexity, we obtained bounds that are much tighter and less sensitive to $I(X;Z)$ than previous theory. The paper provided applications of this theoretical framework, focusing primarily on relevant contradictory experimental work that previously went unexplained. It concluded with experiments showing that the bound presented in this paper corresponds well to experiment.

-A Proof of Lemma 1

Proof.

[TABLE]

∎

-B Proof of Lemma 2

Proof.

We first check that the defined variables $J,U,V$ and $W$ have valid distributions. For $J$ to be valid, we need only check that $\rho<1$ . Indeed by replacing the min operation in $m_{l}(x,y)$ with $p_{Y|X}(y|x)$ , we have

[TABLE]

The variable $U$ is similarly valid as can be seen as follows:

[TABLE]

And the variables $V$ and $W$ follow similarly with ${\int d\mathbb{P}_{V}=\frac{1}{1-\rho}\left(\int d\mathbb{P}_{XY}-\rho\right)=1}$ , and ${\int d\mathbb{P}_{W}=\frac{1}{1-\rho}\left(\int d\hat{\mathbb{P}}_{XY}-\rho\right)=1}$ .

We then need to show that the marginals of the coupling satisfy ${\gamma_{\tilde{X},\tilde{Y},\tilde{Z}}=\mathbb{P}_{XYZ}}$ and ${\gamma_{\hat{X},\hat{Y},\hat{Z}}=\hat{\mathbb{P}}_{XYZ}}$ . To begin, we first show that ${\gamma_{\tilde{X},\tilde{Y}}(x,y)=p_{X,Y}(x,y)}$ and that ${\gamma_{\hat{X},\hat{Y}}(x,y)=\hat{p}_{X,Y}(x,y)}$ as follows:

[TABLE]

Finally, since we defined $\tilde{Z}$ and $\hat{Z}$ through the distributions ${\gamma_{\tilde{Z}|\tilde{X}}(z|x)=\gamma_{\hat{Z}|\hat{X}}(z|x)=p(z|x)}$ , we have

[TABLE]

∎

-C Proof of Lemma 3

Proof.

To prove the first equality, define the following subsets of $\mathcal{Y}$ .

[TABLE]

Then for any coupling of these two models,

[TABLE]

It follows that:

[TABLE]

But we also have for this particular coupling, that ${\mathbb{P}(\tilde{Y}=\hat{Y}|\tilde{X}=\hat{X})\geq P_{J}(1)=\rho}$ . Thus we must have equality.

To prove the second equality, we will use the fact that $\text{min}\{a,b\}=\frac{a+b-|a-b|}{2}$ . Then

[TABLE]

Thus ${\rho=1-\mathbb{E}_{\mathbb{P}_{X}}\left[\frac{1}{2}\sum_{y}|p(y|x)-\hat{p}(y|x)|\right]}$ ∎

-D Proof of Lemma 4

Proof.

[TABLE]

∎

-E Proof of Lemma 5

Proof.

This infinum can be found by the following Lagrangian: ${\mathcal{L}=\mathbb{E}\left[g\cdot(h+log~{}g)\right]+\lambda\left(\mathbb{E}\left[g\right]-1\right)}$ (we will see that we don’t need to worry about the $g(\omega)\geq 0$ constraints because the solution to the lagrangian we just wrote will yield a function $g$ in which those constraints are not tight). The functional derivative of this Lagrangian is ${h(\omega)+log~{}g(\omega)+1+\lambda}$ . Fixing this to zero yields ${g(\omega)=e^{-\lambda}e^{-(h(\omega)+1)}}$ . Setting $\lambda$ through normalization then yields ${g(\omega)=\frac{1}{W}e^{-(h(\omega)+1)}}$ where $W=\mathbb{E}\left[e^{-(h(\omega)+1)}\right]$ . Plugging this solution into our objective yields ${-1-log~{}W=-log~{}\mathbb{E}\left[e^{-(h(\omega)+1)}\right]-1}$ . Since our objective function was a strictly convex functional with a positive second variation given by $\frac{1}{g(\omega)}$ , this is a minimizer. ∎

-F Proof of Lemma 6

Proof.

This follows from reference [21] (Theorem 1) with ${\phi=-log(\cdot)}$ while replacing $h(x;\mu)$ with ${\phi^{\prime\prime}(x)/2=\frac{1}{2x^{2}}}$ . Denote ${Y=e^{-2f^{2}}}$ . The range of $Y$ is a subset of ${[e^{-2},1]}$ . On this set, the supremum of ${\phi^{\prime\prime}(x)/2}$ is $\frac{1}{2}$ . Thus ${log\left(\mathbb{E}\left[Y\right]\right)\leq\mathbb{E}\left[log(Y)\right]+\frac{1}{2}Var\left[Y\right]}$ . But ${Var\left[e^{-2f^{2}}\right]\leq 4Var[f^{2}]\leq 4Var[f]}$ (because $f$ has range bounded by $[0,1]$ ). We thus have ${log\left(\mathbb{E}\left[e^{-2f^{2}}\right]\right)\leq-2\mathbb{E}\left[f^{2}\right]+2Var\left[f\right]}$ . This completes the proof since ${Var\left[f\right]=\mathbb{E}[f^{2}]-\mathbb{E}[f]^{2}}$ . ∎

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” ar Xiv preprint physics/0004057 , 2000.
2[2] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in 2015 IEEE Information Theory Workshop (ITW) . IEEE, 2015, pp. 1–5.
3[3] A. Achille and S. Soatto, “On the emergence of invariance and disentangling in deep representations,” ar Xiv preprint ar Xiv:1706.01350 , 2017.
4[4] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,” ar Xiv preprint ar Xiv:1703.00810 , 2017.
5[5] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox, “On the information bottleneck theory of deep learning,” in International Conference on Learning Representations , 2018. [Online]. Available: https://openreview.net/forum?id=ry_WPG-A-
6[6] D. Blackwell, “Conditional expectation and unbiased sequential estimation,” The Annals of Mathematical Statistics , pp. 105–110, 1947.
7[7] T. M. Cover and J. A. Thomas, Elements of information theory . John Wiley & Sons, 2012.
8[8] S. Verdu et al. , “Generalizing the Fano inequality,” IEEE Transactions on Information Theory , vol. 40, no. 4, pp. 1247–1251, 1994.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Information Losses in Neural Classifiers from Sampling

Abstract

I Introduction

II Notation and Assumptions

III Background

III-A The Information Bottleneck Principle

III-B Information and Generalization

III-C Information Losses

Definition 1**.**

Lemma 1**.**

III-D Automatic Implementation via Neural Networks

IV New Bounds on Information Losses

IV-A Product Form Decomposition - Intuition and Setup

Definition 2** (Coupling).**

Lemma 2**.**

Lemma 3**.**

IV-B Product Form Decomposition - Theorem and Proof

Theorem 1**.**

Proof.

Corollary 1**.**

IV-C Understanding δˉ(P^)\bar{\delta}(\hat{\mathbb{P}})δˉ(P^)

Lemma 4**.**

IV-D Bounding δˉ(P^)\bar{\delta}(\hat{\mathbb{P}})δˉ(P^) - Setting

Definition 3**.**

Definition 4**.**

Theorem 2**.**

Proof.

IV-E Bounding δˉ(P^)\bar{\delta}(\hat{\mathbb{P}})δˉ(P^) - The Asymptotic Case

Lemma 5**.**

Lemma 6**.**

Theorem 3**.**

Proof.

IV-F Bounding δˉ(P^)\bar{\delta}(\hat{\mathbb{P}})δˉ(P^) - The Non-Asymptotic Case

Theorem 4**.**

Proof.

IV-G Some Insights

V Experiments

V-A How These Bounds Solve Experimental Discrepancy

V-B Tightness of Bounds

VI Conclusion

-A Proof of Lemma 1

Proof.

-B Proof of Lemma 2

Proof.

-C Proof of Lemma 3

Proof.

-D Proof of Lemma 4

Proof.

-E Proof of Lemma 5

Proof.

-F Proof of Lemma 6

Proof.

Definition 1.

Lemma 1.

Definition 2 (Coupling).

Lemma 2.

Lemma 3.

Theorem 1.

Corollary 1.

IV-C Understanding $\bar{\delta}(\hat{\mathbb{P}})$

Lemma 4.

IV-D Bounding $\bar{\delta}(\hat{\mathbb{P}})$ - Setting

Definition 3.

Definition 4.

Theorem 2.

IV-E Bounding $\bar{\delta}(\hat{\mathbb{P}})$ - The Asymptotic Case

Lemma 5.

Lemma 6.

Theorem 3.

IV-F Bounding $\bar{\delta}(\hat{\mathbb{P}})$ - The Non-Asymptotic Case

Theorem 4.