Information Losses in Neural Classifiers from Sampling
Brandon Foggo, Nanpeng Yu, Jie Shi, Yuanqi Gao

TL;DR
This paper investigates how finite training datasets cause information loss in neural classifiers, providing bounds that are less sensitive to input compression and align well with experimental observations.
Contribution
It establishes a relationship between information loss and total variation of neural models, deriving dataset size bounds that improve upon previous bounds without relying on model complexity.
Findings
Bounds on information loss are smaller and less sensitive to input compression.
The bounds align well with experimental results on neural network information compression.
Theoretical insights explain recent experimental observations of information compression.
Abstract
This paper considers the subject of information losses arising from the finite datasets used in the training of neural classifiers. It proves a relationship between such losses as the product of the expected total variation of the estimated neural model with the information about the feature space contained in the hidden representation of that model. It then bounds this expected total variation as a function of the size of randomly sampled datasets in a fairly general setting, and without bringing in any additional dependence on model complexity. It ultimately obtains bounds on information losses that are less sensitive to input compression and in general much smaller than existing bounds. The paper then uses these bounds to explain some recent experimental findings of information compression in neural networks which cannot be explained by previous work. Finally, the paper shows that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Information Losses in Neural Classifiers from Sampling
Brandon Foggo, Nanpeng Yu, Jie Shi, and Yuanqi Gao
Abstract
111© 20XX IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
This paper considers the subject of information losses arising from the finite datasets used in the training of neural classifiers. It proves a relationship between such losses as the product of the expected total variation of the estimated neural model with the information about the feature space contained in the hidden representation of that model. It then bounds this expected total variation as a function of the size of randomly sampled datasets in a fairly general setting, and without bringing in any additional dependence on model complexity. It ultimately obtains bounds on information losses that are less sensitive to input compression and in general much smaller than existing bounds. The paper then uses these bounds to explain some recent experimental findings of information compression in neural networks which cannot be explained by previous work. Finally, the paper shows that not only are these bounds much smaller than existing ones, but that they also correspond well with experiments.
I Introduction
An estimator is limited to the information that it has about the variable it’s estimating. But this information is limited to what the estimator has seen from the samples training it. The full information of a random variable cannot be transferred to an estimator by finite samples - some information is lost. This paper analyzes such losses for neural network classifiers. Analyzing these losses can lead to improved architecture designs and training data selection strategies, and provide explanations for empirical results in machine learning theory.
The study of these loses as a tool for deep learning theory arose from the attempts to understand neural network behavior through the concept of an information bottleneck [1, 2]. This theory was later investigated both analytically [3] and experimentally [4, 5]. They are used, primarily, as an explanatory tool which can act as a supplement to classical statistical learning theory (CSLT), which typically fails to explain the success of deep learning models (for example, deep networks tend to perform better when they have higher VC dimension, while CSLT would predict the opposite). We will further discuss the utility of these losses in section III, and we will denote this newly arising field of deep learning theory as information theoretic deep learning theory (ITDLT).
But this theory is still somewhat incomplete. The reader will find that reference [5] above actually contradicts the others - giving experimental evidence against some of the claims established in the earlier works. In particular, ITDLT, as it previously stood, would claim that neural networks should always act as a lossy compressor of the input data - a claim which arises from bounds on information losses that are exponential in the information content of the final hidden layer of the network (while still being smaller than CSLT bounds for larger networks). But experiments show that this is only sometimes true. While compression does seem to always occur when using saturating activation functions, like sigmoid and tanh, compression in networks using linear and relu activation functions seems to be more nuanced.
But instead of abandoning ITDLT, we believe that the theory can be improved in such a way that it explains all of these experiments. Since most contrary evidence to the theory can be traced to those exponential bounds, we hypothesize that these bounds, while tighter than those of CSLT, are still not quite tight enough to account for every experiment. In this paper, we aim to derive bounds which are much tighter than the existing ones. This will make up the bulk of this paper, and can be found in section IV.
With these new bounds, we will be able to explain the experimental discrepancy found in the above literature, giving detail into why some situations yield neural network compression, even with relu activation functions, and others do not. For example, in the case of low entropy feature spaces, our bounds show that there is simply not enough information to lose such that compression is beneficial. We will illustrate this concept further in section V-A.
This will lead to a better understanding of the information relationships found in neural networks, and to a better understanding of neural networks in general. This better understanding will allow guided development of network architectures and other algorithms which are theoretically sound.
In one critical step to achieving these bounds (Theorem 1), we decompose information losses as a product of a term that mostly depends on network architecture and a term that mostly depends on the training dataset used to train that architecture. This decomposition can thus be applied to network architecture design and training data selection strategies independently. These aspects of applying this theory will be the subject of future work.
Finally, while these new bounds are much tighter than both CSLT bounds and the old ITDLT bounds, and while they are capable of explaining all experiments in literature, we will see experimentally that these bounds are fairly tighter than they needed to be to achieve our goals. This will be shown experimentally in section V-B.
Section II will address some notations and assumptions that we will use throughout the paper. Section III will provide more details into the literary background and motivation of this work. We conclude in section VI.
II Notation and Assumptions
Capital letters denote random variables. Lower case letters describe instances of the corresponding random variable. Figure 1 depicts the classification model used in this paper. A class variable generates a feature vector according to a fixed (unknown) distribution . This feature vector is then fed through a learned distribution , which acts as a lossy compressor of . This should be thought of as the hidden layers of a neural network. is then used to form an estimator of , denoted . We will drop the subscripts on probability distributions when the context is clear. The calligraphic symbols and refer to the set of values that and can take on. We assume that is a Polish space such as and that is a finite set with the discrete topology.
This model has three variables of interest, and which satisfy the Markov chain . We denote the true model as and consider the case of estimating the conditional probability distribution . We denote this estimate as and denote the estimated full model as . We will use the hat notation for all information theoretic quantities referring to the estimated model. For example:
[TABLE]
Finally, we assume that all distributions can be written as density functions such as . We will occasionally drop the variable-specifying subscript when the context is clear. We will assume that the support of is all of .
III Background
III-A The Information Bottleneck Principle
The use of the compressor comes from the Information Bottleneck Problem [1] which attempts to find a variable that is minimally sufficient for the input pair of variables . The minimal sufficiency of refers to the following two properties. First, and must be conditionally independent given , or, put in a more enlightening way, . And second, for any other sufficient statistic , . Intuitively, a minimally sufficient statistic is the most efficient description of which retains all of the available information about the class variable . Further reasons that we wish to find a minimally sufficient statistic will become clear in the following sections.
III-B Information and Generalization
We now focus on the reason for caring about the first aspect of finding a minimally sufficient statistic. That is, on finding a variable such that , or, in a more relaxed form, at least ensuring finding one such that is relatively large. Pursuing this goal is backed by information theory as well as standard estimation theory. On the estimation theory side, this property just amounts to ensuring that be a sufficient statistic for and . It thus has importance in finding optimal estimators, for example, through the Rao-Blackwell theorem [6]. On the information theoretic side, if , then having an instance would completely determine the corresponding instance , and so there exists an estimator of that takes as input and has zero probability of error. This notion can be expanded to by Fano’s inequality and its generalizations [7] [8]. Fano’s inequality provides the following bound on estimation error for any estimator of defined as a function of :
[TABLE]
where is the error rate of the estimator and denotes the binary entropy function . This inequality has a left hand side (LHS) that is strictly increasing in for . Thus the restriction of the LHS to is invertible, and since is fixed, we can say that is lower bounded by a monotonically decreasing function of . In some cases we do achieve near equality in (1) - particularly when 1.) the estimator performs (nearly) equally well on each class and 2.) the estimator incurs relatively low levels of compression when compared to that which was incurred in the map .
III-C Information Losses
We now turn to the reason for caring about the second aspect of finding a minimally sufficient statistic - the minimality. This is where the role of our sampled data comes into play, and with it, the concept of information losses.
When we train on a finite sample of data, achieving the first aspect of a minimally sufficient statistic - the sufficiency - becomes difficult. This is because, no matter what representation we choose, we always have an information loss of the form:
[TABLE]
(The superscript (1) here is to distinguish between this form of information loss and another form which will appear later. We will call the current form type one information losses). In choosing our representation, we will only be able to control the latter term in this expression, as that term corresponds to the model we have estimated from our training data. Thus, if this loss is large, then, no matter what we do, we will have trouble in making as large as possible.
Throughout this paper, we will find that this term, , depends on . In the old bounds (i.e. previous to this paper), its dependence is exponential [9]:
[TABLE]
where is the number of training samples. And so we see that, at least in this form, keeping low is pertinent.
In this paper, we will find that the dependence on is relaxed to a linear one. Thus it may not always be so clear that we should minimize . A perhaps more illuminating perspective can be found if we transfer instead to what we call type two information losses. These relate the best possible representation (in terms of achieving sufficiency) to the one that we would obtain by optimizing jointly with our estimated probability distribution. Before describing this new type of information loss, we will need to rigorously define the representations that we qualitatively described in the previous sentence.
Definition 1**.**
Let . We denote as and any random variables that are at most for the following information bottleneck problems respectively:
[TABLE]
[TABLE]
We will then define type two information losses as
[TABLE]
which is, in general, a function of . Then, rearranging, we see that the quantity we care about, , is given by , and so picking an that maximizes this expression is critical, though it may not always result in a direct minimization of .
In any case, it is easy to convert bounds on type one information losses into corresponding bounds on type two information losses, as we will see in the next lemma.
Lemma 1**.**
Suppose that we have a bound of the form , where can be any function of any number of arguments. Then:
[TABLE]
III-D Automatic Implementation via Neural Networks
There is evidence [4][3] that neural networks automatically solve the information bottleneck problem. The first set of evidence is experimental. Authors of [4] found that a wide range of neural networks undergo training in two phases. In the first phase, the neural networks memorized the inputs. This corresponded to an increase of and simultaneously. During this phase, the average magnitude of back-propagated gradients surpassed the variance. In the second phase, this dynamic swapped and the variance surpassed the average. During this phase, increased, but dropped - the neural networks were compressing the input to learn more about .
The second set of evidence is theoretical. The authors of [3] show that is tightly related to the information between the weights and the data . This relationship holds with only a few assumptions on the corresponding neural network. They then shown that is small when the network converges to a wide local minimum of the cross entropy loss function. Finally, they argue that stochastic gradient descent tends to converge to such minima.
Some more recent experimental evidence [5] counters these two arguments. This new evidence shows that some networks can achieve high without compression. Thus some networks can significantly outperform the lower bound of inequality (3). This paper presents new lower bounds which are much tighter and less sensitive to than (3). These bounds - while useful on their own right- help to explain this counter evidence.
IV New Bounds on Information Losses
We will now move on to deriving the new bounds on information losses.
IV-A Product Form Decomposition - Intuition and Setup
Our first major step is a decomposition of information losses into a product of two terms, one being , and the other being a term related to a statistical distance between and . The proof of this decomposition takes some setting up. The setup is performed by generalizing the well studied maximal coupling [10] from statistics to our purposes. We will call our generalization the conditional maximal coupling, and will begin its construction by quickly reviewing couplings in general [11].
Definition 2** (Coupling).**
Given two probability models and on a list of variables , a coupling of these models is a pair of random variables with joint distribution such that the marginal distributions satisfy and .
{const}
[Conditional Maximal Coupling] We set our coupling as follows. First, define the function through
[TABLE]
Next, define a real number as
[TABLE]
and define as a Bernoulli random variable with success probability . Then define variables and through
[TABLE]
Next define as functions of the above random variables as follows:
[TABLE]
Finally, we define and through
[TABLE]
Lemma 2**.**
Construction 2 yields a valid coupling.
Lemma 3**.**
The definitions of Construction 2 satisfy the following relationship:
[TABLE]
Motivated by Lemma 3, we will denote as . This notation emphasizes its role as an average total variation distance. This finishes our setup for the decomposition, which we will now move on to prove.
IV-B Product Form Decomposition - Theorem and Proof
Theorem 1**.**
[TABLE]
Proof.
We will use several Markov chains in this proof. All of them follow from the following Bayesian network describing the generative process of all relevant random variables which is shown in figure 2. Each Markov chain that we use comes from the fact that the variables d-separate the variables from the rest of the network.
First, via coupling, we have
[TABLE]
We decompose the above terms as follows:
[TABLE]
But, due to the Markov chains and , we have . Furthermore, , so:
[TABLE]
We can further decompose each of these terms as:
[TABLE]
But we have from the Markov chains and that , so these terms will disappear from the decomposition. Next, we can break down the term to:
[TABLE]
and similarly, we can break down:
[TABLE]
But when , . Thus, in total, is given by:
[TABLE]
which can be bounded by the triangle inequality on each inner term.
Now, from the Markov chains , , , and , we have (via applications of the data processing inequality and its corollaries [7]):
[TABLE]
Further, and . Then as , we have
[TABLE]
And so, in total, we have
[TABLE]
which completes the proof. ∎
A potentially useful special case of this bound occurs when we set :
Corollary 1**.**
If is discrete,
[TABLE]
But we won’t be using this corollary in the rest of the paper.
IV-C Understanding
The above relationships looks linear on . However, is typically learned jointly with and therefore may itself depend on . Thus we cannot yet say that this relationship is truly linear, and we certainly cannot yet say that it is tight. Before we can make those claims, we will need to study explicitly. We will begin with a ‘sanity-check’ lemma. This lemma shows us that does at least converge with the convergence of a typical neural classifier loss function. It arises from an application of Pinsker’s inequality [12].
Lemma 4**.**
Suppose that . Then:
[TABLE]
where is the conditional cross entropy between and , i.e. the usual cross entropy loss function.
This lemma is particularly applicable when we are estimating our cross entropy error on a validation set, as we can then take in this lemma to be the empirical measure corresponding to the validation or training sample, in which we are almost certain to have . In this sense Lemma 4 can bound such empirical estimates of .
IV-D Bounding - Setting
Finally, we will derive a rate of decrease for in a general continuous learning algorithm. Our setup will involve defining a learning algorithm as a continuous map from a special topology on input probability measures on to conditional probability functions. This is basically to say that, given a training dataset (i.e. an empirical measure on ), we have a well-behaved way of obtaining the corresponding . This is just slightly generalized so that we can consider any input measure (empirical or not) as a ‘training dataset’. We begin by reviewing that special topology, and then we will construct the topology that we will place on our output conditional probability distributions.
Definition 3**.**
Let denote the set of Borel probability measures on . Then the -topology [13] (page 263) is the topology generated by the sets for all bounded Borel measurable functions , all and all . If we restrict to bounded continuous functions, we get the weak topology , which is strictly coarser than the -topology.
Definition 4**.**
Let be the probability simplex in dimensions. Let denote the space of absolutely integrable functions from to with norm . Let denote the product space on , consisting of functions from to which are absolutely integrable in each output dimension, and with norm . Finally, let denote the subspace of to the set of functions whose co-domain is .
The topology we’ve placed on is metrized by the conditional total variation function that we’ve been working with. With these topologies defined, we will restrict ourselves to the study of algorithms which act as continuous maps between these topologies. This essentially requires that, when our training datasets are very similar (e.g. moving one training point to a point within a distance from the original), our algorithm will return very similar output functions in terms of conditional total variation. Thus this condition is somewhat related to algorithmic stability [14], though not completely equivalent.
We will obtain two bounds on in the remains of this paper. The first is asymptotic, and applies when we have continuity from the -topology. The second is non asymptotic, and applies when we further have continuity from the weak topology. We will next show that gradient descent algorithms, under mild conditions, achieve these continuities.
Theorem 2**.**
Let denote a normed parameter space and let denote a loss function which is integrable in for each , which is differentiable with respect to for all , and whose -gradients yield bounded continuous functions on when evaluated at each point . Suppose further that our parameter space admits lipschitz-continuous outputs for each . That is, . Then gradient descent applied to the empirical risk minimization of , with a fixed initiation and which proceeds for a fixed number of iterations, is continuous from to .
If we relax the condition that the gradients of be bounded continuous functions on when evaluated at each point to just bounded measurable functions, then this algorithm is still continuous from to .
Proof.
The assumptions on allow us to differentiate (with respect to ) under the integral sign. Let denote the step size of the iteration. Let . We proceed by induction on the number of iterations.
Let . Let . Let and let be contained in the open set of the weak topology given by (which clearly contains ). Let denote the parameter chosen after one gradient update when training on , and let denote the parameter chosen after one gradient update when training on . Then:
[TABLE]
so
[TABLE]
and so the hypothesis is true if our algorithm consists of one iteration.
Suppose that the hypothesis when we use iterations. Let . Let and let . Chose an open set of the weak topology such that when which is possible by the induction hypothesis, and where and denote the chosen parameters after iteration of the gradient descent when trained on and . Let . Then by the triangle inequality:
[TABLE]
so the conditional total variation between and is less than or equal to which is equal to .
For the final statement, note that all of the above open sets in the -topology used in this proof remain open sets in the -topology when we relax the conditions of . This completes the proof. ∎
IV-E Bounding - The Asymptotic Case
We now wish to bound the conditional total variation of an estimated model against the true model when we use such a general learning algorithm in our setting. We will re-label to to emphasize that our estimated model is coming from such an algorithm. We then have the following asymptotic theorem on the rate of decay for . This will apply whenever we have continuity from the -topology in our algorithm, and will be used in our non-asymptotic specialization that follows. We will use two final lemmas in both of those proofs.
Lemma 5**.**
Let be a probability space and let be bounded and measurable. Let denote the set of non-negative measurable functions with expectation . Then .
Lemma 6**.**
Let be a probability space and let be bounded and measurable with . Then .
Theorem 3**.**
Let , and let . If is a continuous learning algorithm from to such that, for any , the total variation between and is smaller than the total variation between and at any point in the support of . Suppose further that the ‘training’ total variation, , is bounded above by . Then:
[TABLE]
where is the probability measure on induced by the sampling of data-points on .
Proof.
For notational convenience, we will denote as the conditional total variation between and for a fixed .
We will first need to show that the map , given by is continuous from the -topology to the Euclidean topology. This is trivial since is just the composition of , which was assumed continuous, with the fixed-point distance function defined over .
Now, let . By the above continuity and by the fact that is closed in , we have that is closed. Then, by Sanov’s Theorem [13]:
[TABLE]
We thus wish to lower bound over . We begin by decomposing into . Where and are the marginal distributions of and on . We are guaranteed that the functions and exist on the support of since is discrete. The KL-divergence then becomes: where is bounded below (via Pinsker’s inequality) by the function , which itself is bounded below by because the absolute value of the second term in this expression is smaller than that of the first term for each point in the support of . The first term is just the function defined at the start of this proof. We will call the second term . We can lower bound this expression one more time with . We are left with:
[TABLE]
We will bound these two remaining terms separately. The second is taken care of in this theorem’s hypothesis, being bounded below by . For the latter, we can combine Lemmas 5 and 6 to obtain a lower bound of (since ).
Since neither of these two bounds depend on , negating their sum yields the result. ∎
IV-F Bounding - The Non-Asymptotic Case
The previous theorem gives us:
[TABLE]
where refers to any terms such that . We will need to study since it’s somewhat of an unknown here, and may be large for small . The next theorem, which is non-asymptotic, will take care of this when is continuous from the weak topology.
Theorem 4**.**
Take all assumptions from Theorem 3, but remove the assumption that be a continuous map from to and assume it is instead continuous linear from . Suppose further that is compact, and that has full support with density everywhere. Then there exists a function with such that:
[TABLE]
(A more detailed description of , from which we can discover more of its properties, is contained in the proof).
Proof.
Let the notations and be defined as they were in the proof of Theorem 3.
Let constitute a family of conditions, indexed first by samples of points of and second by functions , which constitute that , where the second expectation is the monte-carlo estimate over the indexed sample.
Let the sets , indexed first over samples of consisting of points and second over the set , be given by (where the run over the sampled points in and runs over the possible choices of ). Let denote the family of conditions where the run over the sampled points and the choices of and correspond to those of . Let denote the condition on measures such that there exists a measure with . Note that .
Let denote the vector space of finite signed measures on endowed with the weak topology. For any probability measure , let be the subspace of measures with marginal distribution . Let be the subset of consisting of probability measures. Define a linear map on , denoted , which takes to its disintegration .
Let denote the family of real valued function (indexed by ) taking to the value , which is to be taken as infinite when the support of is not a superset of the support of , and is further infinite when is not absolutely continuous with respect to . Note that each is convex and continuous in the weak topology for each fixed (as and everywhere by the theorem’s hypothesis), and each is concave and continuous for each fixed .
Now, since is compact, is compact in the weak topology. Then for any , is compact (being a closed subset of a compact space). Then is compact and convex. We also have that the subsets , , and are all closed, and therefore compact. We also have convexity in , but not in the other two.
Arbitrarily pick some with full support and denote as as . Let denote the minimum of the expression over and denote the minimizer as . The image of the map is a compact subset of - i.e. a closed and bounded interval . Let denote the union of these intervals over the finite indices . Cover this interval with a family of subintervals of size .
We will now fix to be the smallest number such that there exists a sample in which both for all in which and for all in which . Such a exists, and is less than or equal to since is all of . Fix to any of the samples that we just established the existence of. We will drop the notations and from the notation for any conditions referring to them from now on.
Now, denote as the set of bounded continuous functions from to and construct a family of maps indexed over and which takes to . Then for any empirical corresponding to a sample of points, we have that for all . Thus the probability that is in is bounded above by the probability that . Then by Chernoff’s inequality, we have that is bounded above by:
[TABLE]
where the first expectation is taken over .
The first term can be reduced to . Optimizing over yields a bound of
[TABLE]
We will denote as the set of conditional probability functions such that there exists with disintegration given by . We will also denote a function defined on which yields when the support of the latter argument is equal to the domain of the former, and is infinite otherwise. Note that is convex and lower-semicontinuous in for fixed since it is linear in the convex subset and infinite outside of this subset. Finally, we will define the function given by . This function is concave in , convex in , and lower semicontinuous in [13]. Then (39) is upper bounded by:
[TABLE]
Note also that the the objective function of this expression is decoupled for and . We can thus swap the supremum with the first infinum. But then inside the first infinum, we are left with an objective function in which a minimax theorem applies [15] because is compact and convex in the weak topology when is compact, and so we can swap the supremum with the second infinum as well. Since the first term does not depend on , we can then consider for each fixed the expression . But the supremum of this function over is none other than the divergence between and [16]. We are thus left with a full upper bound of (now optimizing over ):
[TABLE]
We would be able to swap the supremum and infinum if our feasible set were convex and compact. This is true for our search space over , but not for . Our goal is to then transform into , which is convex, with corresponding error terms included. This can be done by tightening to and then relaxing that set to , this will incur some error, but if we end up choosing to be the disintegration of , then this error will be bounded by .
With our feasible set now being , we can swap the supremum and infinum, and then pick to be equal to on the support of , and arbitrary elsewhere. The objective function is then just the minimum divergence over , which we know how to deal with due to the proof of Theorem 3. Minimizing then gives us both given by the disintegration of , and with the objective function bounded by . If we again add the constraint to the feasible region (with another error of at most added on), then this is bounded above by . Union bounding over yields the result. ∎
IV-G Some Insights
We have established that, with probability at least , the following holds:
[TABLE]
where and we can usually take (as we can make this arbitrarily small with a large enough network, due to [17] and lemma 4 if we train on cross-entropy errors). is trivially less than or equal to , but it is generally going to be quite small since it is dependent on a statement only requiring the existence of functions satisfying an empirical deviation bound. This is in contrast to classical statistical learning theory bounds which instead require for all functions statements of the same sort. Furthermore, is not strictly increasing with model complexity. On the contrary, can decrease as the hypothesis space grows (given that we maintain continuity), since having more functions will increase the probability of such existences. By Theorem 3, we can also assume that as . These intuitions tell us that the decomposition in Theorem 1 has successfully extracted a good amount of the problem’s complexity into the term . The primary complexity term in - given a sufficiently complex hypothesis space - arises from the complexity of the class variable itself.
V Experiments
V-A How These Bounds Solve Experimental Discrepancy
We argue that the bounds presented in this paper explain the experimental discrepancy that we’ve alluded to a few times in this paper. These tightened, less sensitive bounds imply that, in many cases, it is simply not optimal in terms of information losses to compress a neural network’s input. This can be seen visually in Figure 3. Here we have set up a toy classification problem with , , and . The information quantities in this toy example are thus similar to MNIST [18]. We have plotted along with the bounds of this paper (assuming ) for , , and data points. We see that very little to nothing can be gained by compression in the and cases. Serious gains can only be obtained in the case. On the right side of this figure, we plot the old bounds, which predicts a peak at around bits even for data points. Thus the lack of compression found experimentally on smaller datasets is explained by our new bounds, but not by the old ones.
But if the entropy of the feature space becomes large, as we’ve made it for the third plot in this figure, compression becomes important even with our new bounds. This helps to explain why neural networks seem to yield compression on ‘harder’ datasets, but do not on ‘easier’ ones.
V-B Tightness of Bounds
For these experiments, we have used the MINE-f [19] estimator of mutual information for quantities. We assume that is equal to , and estimate via validation error probability and Fano’s inequality. To make the classifier representation stochastic, we used permanent dropout with a rate of . All classifiers are trained for epochs, and all information estimations are performed for epochs. All neural networks are trained with the Adam optimizer. All models used a learning rate of .
We first tested the non-asymptotic bound of Theorem 4 on four of the datasets provided by OpenML [20] across several training data sizes (dependent on the overall size of the dataset in question). Our classifier consisted of a neural network with a single hidden layer of units. The results are plotted in figure 4. We took a confidence interval for the plot of the bound, and plotted the mean value of ten experiments for the ‘true’ confidence interval (assuming a symmetric distribution). We estimated via with . In each case, we estimated and in sample for the smallest tested training data size. This, of course, only gives us a ‘functional behavior’ experiment, but we do see that this behavior is consistent with the true values.
We then tested the bound of Theorem 1 for MNIST and Cifar-10 using the true value of in each case. The results are shown in Figure 5. Each dataset is experimented on with a classifier given by a fully connected neural network with single hidden layer, with varying hidden layer sizes. The deviations here are to show that the bound is decent across differing architectures. The bound is quite close to the true confidence interval in each case.
VI Conclusion
This paper presented new bounds on information losses from finite data. This began in the form of a relationship between these losses, the expected total variation of the neural model, and the information held in the hidden representation of the feature space. Then, by bounding the total variation term without invoking any more dependence on model complexity, we obtained bounds that are much tighter and less sensitive to than previous theory. The paper provided applications of this theoretical framework, focusing primarily on relevant contradictory experimental work that previously went unexplained. It concluded with experiments showing that the bound presented in this paper corresponds well to experiment.
-A Proof of Lemma 1
Proof.
[TABLE]
∎
-B Proof of Lemma 2
Proof.
We first check that the defined variables and have valid distributions. For to be valid, we need only check that . Indeed by replacing the min operation in with , we have
[TABLE]
The variable is similarly valid as can be seen as follows:
[TABLE]
And the variables and follow similarly with , and .
We then need to show that the marginals of the coupling satisfy and . To begin, we first show that and that as follows:
[TABLE]
[TABLE]
Finally, since we defined and through the distributions , we have
[TABLE]
[TABLE]
∎
-C Proof of Lemma 3
Proof.
To prove the first equality, define the following subsets of .
[TABLE]
Then for any coupling of these two models,
[TABLE]
It follows that:
[TABLE]
But we also have for this particular coupling, that . Thus we must have equality.
To prove the second equality, we will use the fact that . Then
[TABLE]
Thus ∎
-D Proof of Lemma 4
Proof.
[TABLE]
∎
-E Proof of Lemma 5
Proof.
This infinum can be found by the following Lagrangian: (we will see that we don’t need to worry about the constraints because the solution to the lagrangian we just wrote will yield a function in which those constraints are not tight). The functional derivative of this Lagrangian is . Fixing this to zero yields . Setting through normalization then yields where . Plugging this solution into our objective yields . Since our objective function was a strictly convex functional with a positive second variation given by , this is a minimizer. ∎
-F Proof of Lemma 6
Proof.
This follows from reference [21] (Theorem 1) with while replacing with . Denote . The range of is a subset of . On this set, the supremum of is . Thus . But (because has range bounded by ). We thus have . This completes the proof since . ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,” ar Xiv preprint physics/0004057 , 2000.
- 2[2] N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in 2015 IEEE Information Theory Workshop (ITW) . IEEE, 2015, pp. 1–5.
- 3[3] A. Achille and S. Soatto, “On the emergence of invariance and disentangling in deep representations,” ar Xiv preprint ar Xiv:1706.01350 , 2017.
- 4[4] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,” ar Xiv preprint ar Xiv:1703.00810 , 2017.
- 5[5] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox, “On the information bottleneck theory of deep learning,” in International Conference on Learning Representations , 2018. [Online]. Available: https://openreview.net/forum?id=ry_WPG-A-
- 6[6] D. Blackwell, “Conditional expectation and unbiased sequential estimation,” The Annals of Mathematical Statistics , pp. 105–110, 1947.
- 7[7] T. M. Cover and J. A. Thomas, Elements of information theory . John Wiley & Sons, 2012.
- 8[8] S. Verdu et al. , “Generalizing the Fano inequality,” IEEE Transactions on Information Theory , vol. 40, no. 4, pp. 1247–1251, 1994.
