GAIT: A Geometric Approach to Information Theory
Jose Gallego-Posada, Ankit Vani, Max Schwarzer, Simon, Lacoste-Julien

TL;DR
This paper introduces GAIT, a geometric information theory framework that incorporates symbol similarities into entropy measures, providing efficient divergence computations and versatile applications across generative modeling, image analysis, and empirical data approximation.
Contribution
The paper presents a novel geometry-aware entropy and divergence framework that integrates symbol similarities, offering computational efficiency and broad applicability.
Findings
Performance comparable to Wasserstein-based methods
Closed-form divergence expression for efficiency
Versatile applications demonstrated across domains
Abstract
We advocate the use of a notion of entropy that reflects the relative abundances of the symbols in an alphabet, as well as the similarities between them. This concept was originally introduced in theoretical ecology to study the diversity of ecosystems. Based on this notion of entropy, we introduce geometry-aware counterparts for several concepts and theorems in information theory. Notably, our proposed divergence exhibits performance on par with state-of-the-art methods based on the Wasserstein distance, but enjoys a closed-form expression that can be computed efficiently. We demonstrate the versatility of our method via experiments on a broad range of domains: training generative models, computing image barycenters, approximating empirical measures and counting modes.
| Joint Entropy | |
|---|---|
| Conditional Entropy | |
| Mutual Information | |
| Conditional M.I. |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Management and Algorithms · Rough Sets and Fuzzy Logic · Advanced Database Systems and Queries
GAIT: A Geometric Approach to Information Theory
Jose Gallego Ankit Vani Max Schwarzer Simon Lacoste-Julien†
Mila and DIRO, Université de Montréal
Abstract
We advocate the use of a notion of entropy that reflects the relative abundances of the symbols in an alphabet, as well as the similarities between them. This concept was originally introduced in theoretical ecology to study the diversity of ecosystems. Based on this notion of entropy, we introduce geometry-aware counterparts for several concepts and theorems in information theory. Notably, our proposed divergence exhibits performance on par with state-of-the-art methods based on the Wasserstein distance, but enjoys a closed-form expression that can be computed efficiently. We demonstrate the versatility of our method via experiments on a broad range of domains: training generative models, computing image barycenters, approximating empirical measures and counting modes.
1 Introduction
Shannon’s seminal theory of information (1948) has been of paramount importance in the development of modern machine learning techniques. However, standard information measures deal with probability distributions over an alphabet considered as a mere set of symbols and disregard additional geometric structure, which might be available in the form of a metric or similarity function. As a consequence of this, information theory concepts derived from the Shannon entropy (such as cross entropy and the Kullback-Leibler divergence) are usually blind to the geometric structure in the domains over which the distributions are defined.
This blindness limits the applicability of these concepts. For example, the Kullback-Leibler divergence cannot be optimized for empirical measures with non-matching supports. Optimal transport distances, such as Wasserstein, have emerged as practical alternatives with theoretical grounding. These methods have been used to compute barycenters (Cuturi and Doucet, 2014) and train generative models (Genevay et al., 2018). However, optimal transport is computationally expensive as it generally lacks closed-form solutions and requires the solution of linear programs or the execution of matrix scaling algorithms, even when solved only in approximate form (Cuturi, 2013). Approaches based on kernel methods (Gretton et al., 2012; Li et al., 2017; Salimans et al., 2018), which take a functional analytic view on the problem, have also been widely applied. However, further exploration on the interplay between kernel methods and information theory is lacking.
Contributions. We i) introduce to the machine learning community a similarity-sensitive definition of entropy developed by Leinster and Cobbold (2012). Based on this notion of entropy we ii) propose geometry-aware counterparts for several information theory concepts. We iii) present a novel notion of divergence which incorporates the geometry of the space when comparing probability distributions, as in optimal transport. However, while the former methods require the solution of an optimization problem or a relaxation thereof via matrix-scaling algorithms, our proposal enjoys a closed-form expression and can be computed efficiently. We refer to this collection of concepts as Geometry-Aware Information Theory: GAIT.
Paper structure. We introduce the theory behind the GAIT entropy and provide motivating examples justifying its use. We then introduce and characterize a divergence as well as a definition of mutual information derived from the GAIT entropy. Finally, we demonstrate applications of our methods including training generative models, approximating measures and finding barycenters. We also show that the GAIT entropy can be used to estimate the number of modes of a probability distribution.
Notation. Calligraphic letters denote ets, bold letters represent atrices and ectors, and double-barred letters denote robability distributions and information-theoretic functionals. To emphasize certain computational aspects, we alternatively denote a distribution over a finite space as a vector of probabilities . , and denote the identity matrix, a vector of ones and matrix of ones, with context-dependent dimensions. For vectors , and , and denote element-wise division and exponentiation. denotes the Frobenius inner-product between two vectors or matrices. denotes the probability simplex over elements. denotes a Dirac distribution at point . We adopt the conventions and for .
Reproducibility. Our experiments can be reproduced via: https://github.com/jgalle29/gait
2 Geometry-Aware Information Theory
Suppose that we are given a finite space with elements along with a symmetric function that measures the similarity between elements, . Let be the matrix induced by on ; i.e, . indicates that the elements and are identical, while indicates full dissimilarity. We assume that for all . We call a (finite) similarity space. For brevity we denote by whenever is clear from the context.
Of particular importance are the similarity spaces arising from metric spaces. Let be a metric space and define . Here, the symmetry and range conditions imposed on are trivially satisfied. The triangle inequality in induces a multiplicative transitivity on : for all , . Moreover, for any metric space of the negative type, the matrix of its associated similarity space is positive definite (Reams, 1999, Lemma 2.5).
In this section, we present a theoretical framework which quantifies the “diversity” or “entropy” of a probability distribution defined on a similarity space, as well as a notion of divergence between such distributions.
2.1 Entropy and diversity
Let be a probability distribution on . induces a similarity profile , given by .111This denotes the -th entry of the result of the matrix-vector multiplication . represents the expected similarity between element and a random element of the space sampled according to . Intuitively, it assesses how “satisfied” we would be by selecting as a one-point summary of the space. In other words, it measures the ordinariness of , and thus is the rarity or distinctiveness of (Leinster and Cobbold, 2012). Note that the distinctiveness depends crucially on both the similarity structure of the space and the probability distribution at hand.
Much like the interpretation of Shannon’s entropy as the expected surprise of observing a random element of the space, we can define a notion of diversity as expected distinctiveness: . This arithmetic weighted average is a particular instance of the family of power (or Hölder) means. Given and , the weighted power mean of order is defined as . Motivated by this averaging scheme, Leinster and Cobbold (2012) proposed the following definition:
Definition 1**.**
(Leinster and Cobbold, 2012)* (GAIT Entropy) The GAIT entropy of order of distribution on finite similarity space is given by:*
[TABLE]
It is evident that whenever , this definition reduces to the Rényi entropy (Rényi, 1961). Moreover, a continuous extension of Eq. (1) to via a L’Hôpital argument reveals a similarity-sensitive version of Shannon’s entropy:
[TABLE]
Let us dissect this definition via two simple examples. First, consider a distribution over the points at distance , and define the similarity . As the points get further apart, the Gram matrix transitions from to . Fig. 1 displays the behavior of . We observe that when is large we recover the usual shape of Shannon entropy for a Bernoulli variable. In contrast, for low values of , the curve approaches a constant zero function. In this case, we regard both elements of the space as identical: no matter how we distribute the probability among them, we have low uncertainty about the qualities of random samples. Moreover, the exponential of the maximum entropy, , measures the effective number of points (Leinster and Meckes, 2016) at scale .
Now, consider the space presented in Fig. 2, where the edge weights denote the similarity between elements. The maximum entropy distribution in this space following Shannon’s view is the uniform distribution . This is counter-intuitive when we take into account the fact that points A and B are very similar. We argue that a reasonable expectation for a maximum entropy distribution is one which allocates roughly probability to point C and the remaining mass in equal proportions to points A and B. Fig. 3 displays the value of for all distributions on the 3-simplex. The green dot represents , while the black star corresponds to the maximum GAIT entropy with [A, B, C]-coordinates . The induced similarity profile is . Note how Shannon’s probability-uniformity gets translated into a constant similarity profile.
Properties. We now list several important properties satisfied by the GAIT entropy, whose proofs and formal statements are contained in (Leinster and Cobbold, 2012) and (Leinster and Meckes, 2016):
- •
Range: .
- •
-monotonicity: Increasing the similarity reduces the entropy. Formally, if for all , then .
- •
Modularity: If the space is partitioned into fully dissimilar groups, , so that is a block matrix (), then the entropy of a distribution on is a weighted average of the block-wise entropies.
- •
Symmetry: Entropy is invariant to relabelings of the elements, provided that the rows of are permuted accordingly.
- •
Absence: The entropy of a distribution over remains unchanged when we restrict the similarity space to the support of .
- •
Identical elements: If two elements are identical (two equal rows in ), then combining them into one and adding their probabilities leaves the entropy unchanged.
- •
Continuity: is continuous in for fixed , and continuous in (w.r.t. standard topology on ) for fixed .
- •
-Monotonicity: is non-increasing in .
The role of . Def. 1 establishes a family of entropies indexed by a non-negative parameter , which determines the relative importance of rare elements versus common ones, where rarity is quantified by . In particular, . When , , which values rare and common species equally, while only considers the most common elements. Thus, in principle, the problem of finding a maximum entropy distribution depends on the choice of .
Theorem 1**.**
(Leinster and Meckes, 2016)* Let be a similarity space. There exists a probability distribution that maximizes for all , simultaneously. Moreover, does not depend on .*
Remarkably, Thm. 1 shows that the maximum entropy distribution is independent of and thus, the maximum value of the GAIT entropy is an intrinsic property of the space: this quantity is a geometric invariant. In fact, if for a metric on , there exist deep connections between and the magnitude of the metric space (Leinster, 2013).
Theorem 2**.**
(Leinster and Meckes, 2016)* Let be a distribution on a similarity space . is independent of if and only if for all .*
Recall the behavior of the similarity profile observed for in Fig. 2. Thm. 2 indicates that this is not a coincidence: inducing a similarity profile which is constant over the support of a distribution is a necessary condition for being a maximum entropy distribution. In the setting and , the condition for some , is equivalent to the well known fact that the uniform distribution maximizes Shannon entropy.
2.2 Concavity of
A common interpretation of the entropy of a probability distribution is that of the amount of uncertainty in the values/qualities of the associated random variable. From this point of view, the concavity of the entropy function is a rather intuitive and desirable property: “entropy should increase under averaging”.
Consider the case . reduces to the the Rényi entropy of order . For general values of , this is not a concave function, but rather only Schur-concave (Ho and Verdú, 2015). However, coincides with the Shannon entropy, which is a strictly concave function. Since the subsequent theoretical developments make extensive use of the concavity of the entropy, we restrict our attention to the case for the rest of the paper.
To the best of our knowledge, whether the entropy is a (strictly) concave function of for general similarity kernel is currently an open problem. Although a proof of this result has remained elusive to us, we believe there are strong indicators, both empirical and theoretical, pointing towards a positive answer. We formalize these beliefs in the following conjecture:
Conjecture 1**.**
Let be a finite similarity space with Gram matrix . If is positive definite and satisfies the multiplicative triangle inequality, then is strictly concave in the interior of .
Fig. 4 shows the relationship between the linear approximation of the entropy and the value of the entropy over segment of the convex combinations between two measures. This behavior is consistent with our hypothesis on the concavity of .
We emphasize the fact that the presence of the term complicates the analysis, as it incompatible with most linear algebra-based proof techniques, and it renders most information theory-based bounds too loose, as we explain in App C. Nevertheless, we provide extensive numerical experiments in App. C which support our conjecture. In the remainder of this work, claims dependent on this conjecture are labelled ♣.
2.3 Comparing probability distributions
The previous conjecture implies that is a strictly convex function. This naturally suggests considering the Bregman divergence induced by the negative GAIT entropy. This is analogous to the construction of the Kullback-Leibler divergence as the Bregman divergence induced by the negative Shannon entropy.
Straightfoward computation shows that the gap between the negative GAIT entropy at and its linear approximation around evaluated at is:
[TABLE]
Definition 2**.**
(GAIT Divergence)♣* The GAIT divergence between distributions and on a finite similarity space is given by:*
[TABLE]
When , the GAIT divergence reduces to the Kullback-Leibler divergence. Compared to the family of -divergences (Csiszár and Shields, 2004), this definition computes point-wise ratios between the similarity profiles and rather than the probability masses (or more generally, Radon-Nikodym w.r.t. a reference measure). We highlight that provides a global view of the space via the Gram matrix from the perspective of . Additionally, the GAIT divergence by definition inherits all the properties of Bregman divergences. In particular, is convex in .
Forward and backward GAIT divergence. Like the Kullback-Leibler divergence, the GAIT divergence is not symmetric and different orderings of the arguments induce different behaviors. Let be a family of distributions in which we would like to find an approximation to . concentrates around one of the modes of ; this behavior is known as mode seeking. On the other hand, induces a mass covering behavior. Fig. 4 displays this phenomenon when finding the best (single) Gaussian approximation to a mixture of Gaussians.
Empirical distributions. Although we have developed our divergence in the setting of distributions over a finite similarity space, we can effectively compare two empirical distributions over a continuous space. Note that if an arbitrary (or more generally a measurable set for a given choice of -algebra) has measure zero under both and , then such (or ) is irrelevant in the computation of . Therefore, when comparing empirical measures, the possibly continuous expectations involved in the extension of Eq. (2) to general measures reduce to finite sums over the corresponding supports.
Concretely, let be a (possibly continuous) similarity space and consider the empirical distributions and with and . The Gram matrix of the restriction of to has the block structure , where is , is and . It is easy to verify that
[TABLE]
Computational complexity. The computation of Eq. (5) requires operations, where represents the cost of a kernel evaluation. This exhibits a quadratic behavior in the size of the union of the supports, typical of kernel-based approaches (Li et al., 2017). We highlight that Eqs. (2) and (5) provide a quantitative assessment of the dissimilarity between and via a closed form expression. This is in sharp contrast to the multiple variants of optimal transport which require the solution of an optimization problem or the execution of several iterations of matrix scaling algorithms. Moreover, the proposals of Cuturi and Doucet (2014); Benamou et al. (2014) require at least operations, where denotes the number of Sinkhorn iterations, which is an increasing function of the desired optimization tolerance. A quantitative comparison is presented in App. G.
Weak topology. The type of topology induced by a divergence on the space of probability measures plays important role in the context of training neural generative models. Several studies (Arjovsky et al., 2017; Genevay et al., 2018; Salimans et al., 2018) have exhibited how divergences which induce a weak topology constitute learning signals with useful gradients. In App. A, we provide an example in which the GAIT divergence can provide a smooth training signal despite being evaluated on distribution with disjoint supports.
2.4 Mutual Information
We now use the GAIT entropy to define similarity-sensitive generalization of standard concepts related to mutual information. As before, we restrict our attention to . This is required to get the chain rule of conditional probability for the Rényi entropy and to use Conj. 1. Finally, we note that although one could use the GAIT divergence to define a mutual information, in a fashion analogous to how traditional mutual information is defined via the KL divergence, the resulting object is challenging to study theoretically. Instead, we use a definition based on entropy, which is equivalent in spaces without similarity structure.
Definition 3**.**
Let , , be random variables taking values on the similarity spaces , , with corresponding Gram matrices , , . Let , and denotes the expected similarity between object and a random -distributed object. Let be the joint distribution of and . Then the joint entropy, conditional entropy, mutual information and conditional mutual information are defined following the formulas in Table. 1.
Note that the GAIT joint entropy is simply the entropy of the joint distribution with respect to the tensor product kernel. This immediately implies monotonicity in the kernels and . Note also that the chain rule of conditional probability holds by definition.
Subject to these definitions, similarity-sensitive versions of a number theorems analogous to standard results of information theory follow:
Theorem 3**.**
Let , be independent, then:
[TABLE]
When the conditioning variables are perfectly identifiable (), we recover a simple expression for the conditional entropy:
Theorem 4**.**
For any kernel ,
[TABLE]
Using Conj. 1, we are also able to prove that conditioning on additional information cannot increase entropy, as intuitively expected.
Theorem 5**.**
♣*
For any similarity kernel ,*
[TABLE]
Theorem 5 is equivalent to Conj. 1 when considering a categorical mixing over distributions .
Finally, a form of the data processing inequality (DPI), a fundamental result in information theory governing the mutual information of variables in a Markov chain structure, follows from Conj. 1.
Theorem 6**.**
(Data Processing Inequality)♣*.
If is a Markov chain, then*
[TABLE]
Note the presence of the additional term relative to the non-similarity-sensitive DPI given by . Intuitively, this can be understood as reflecting that conditioning on does not convey all of its usual “benefit”, as some information is lost due to the imperfect identifiability of elements in . When this term is 0, and the original DPI is recovered.
3 Related work
Theories of Information. Information theory is ubiquitous in modern machine learning: from variable selection via information gain in decision trees (Ben-David and Shalev-Shwartz, 2014), to using entropy as a regularizer in reinforcement learning (Fox et al., 2016), to rate-distortion theory for training generative models (Alemi et al., 2018). To the best of our knowledge, the work of Leinster and Cobbold (2012); Leinster and Meckes (2016) is the first formal treatment of information-theoretic concepts in spaces with non-trivial geometry, albeit in the context of ecology.
Comparing distributions. The ability to compare probability distributions is at the core of statistics and machine learning. Although traditionally dominated by maximum likelihood estimation, a significant portion of research on parameter estimation has shifted towards methods based on optimal transport, such as the Wasserstein distance (Villani, 2008). Two main reasons for this transition are (i) the need to deal with degenerate distributions (which might have density only over a low dimensional manifold) as is the case in the training of generative models (Goodfellow et al., 2014; Arjovsky et al., 2017; Salimans et al., 2018); and (ii) the development of alternative formulations and relaxations of the original optimal transport objective which make it feasible to approximately compute in practice (Cuturi and Doucet, 2014; Genevay et al., 2018).
Relation to kernel theory. The theory we have presented in this paper revolves around a notion of similarity on . The operator corresponds to the embedding of the space of distributions on into a reproducing kernel Hilbert space used for comparing distributions without the need for density estimation (Smola et al., 2007). In particular, a key concept in this work is that of a characteristic kernel, i.e., a kernel for which the embedding is injective. Note that this condition is equivalent to the positive definiteness of the Gram matrix imposed above. Under these circumstances, the metric structure present in the Hilbert space can be imported to define the Maximum Mean Discrepancy distance between distributions (Gretton et al., 2012). Our definition of divergence also makes use of the object , but has motivations rooted in information theory rather than functional analysis. We believe that the framework proposed in this paper has the potential to foster connections between both fields.
4 Experiments
4.1 Comparison to Optimal Transport
Image barycenters. Given a collection of measures on a similarity space, we define the barycenter of with respect to the GAIT divergence as . This is inspired by the work of Cuturi and Doucet (2014) on Wasserstein barycenters. Let the space denote the pixel grid of an image of size . We consider each image in the MNIST dataset as an empirical measure over this grid in which the probability of location is proportional to the intensity at the corresponding pixel. In other words, image is considered as a measure . Note that in this case the kernel is a function of the distance between two pixels in the grid (two elements of ), rather than the distance between two different images. We use a Gaussian kernel, and compute by convolving the image with an adequate filter, as proposed by Solomon et al. (2015).
Fig. 9 shows the result of gradient-based optimization to find barycenters for each of the classes in MNIST (LeCun et al., 1998) along with the corresponding results using the method of Cuturi and Doucet (2014). We note that our method achieves results of comparable quality. Remarkably, the time for computing the barycenter for each class on a single CPU is reduced from 90 seconds using the efficient method proposed by Cuturi and Doucet (2014); Benamou et al. (2014) (implemented using a convolutional kernel (Solomon et al., 2015)) to less than 5 seconds using our divergence. Further experiments can be found in App. D.
Generative models. The GAIT divergence can also be used as an objective for training generative models. We illustrate the results of using our divergence with a RBF kernel to learn generative models in Fig. 5 on a toy Swiss roll dataset, in addition to the MNIST (LeCun et al., 1998) and Fashion-MNIST (Xiao et al., 2017) datasets. For all three datasets, we consider a 2D latent space and replicate the experimental setup used by Genevay et al. (2018) for MNIST. We were able to use the same -layer multilayer perceptron architecture and optimization hyperparameters for all three datasets, requiring only the tuning of the kernel variance for Swiss roll data’s scale.
Moreover, we do not need large batch sizes to get good quality generations from our models. The quality of our samples obtained using batch sizes as small as are comparable to the ones requiring batch size of by Genevay et al. (2018). We include additional experimental details and results in App. F, along with comparisons to variational auto-encoders (Kingma and Welling, 2014).
4.2 Approximating measures
Our method allows us to find a finitely-supported approximation to a (discrete or continuous) target distribution . This is achieved by minimizing the divergence between them with respect to the locations and/or the masses of the atoms in the approximating measure. In this section, we consider situations where is not a subset of the support of . As a result, the Kullback-Leibler divergence (the case would be infinite and could not be minimized via gradent-based methods. However, the GAIT divergence can be minimized even in the case of non-overlapping supports since it takes into account similarities between items.
In Fig. 8, we show the results of such an approximation on data for the population of France in 2010 consisting of 36,318 datapoints (Charpentier, 2012), similar to the setting of Cuturi and Doucet (2014). The weight of each atom in the blue measure is proportional to the population it represents. We use an RBF kernel and an approximating measure consisting of 50 points with uniform weights, and use gradient-based optimization to minimize with respect to the location of the atoms of the approximating measure. We compare with K-means (Pedregosa et al., 2011) using identical initialization. Note that when using K-means, the resulting allocation of mass from points in the target measure to the nearest centroid can result in a highly unbalanced distribution, shown in the bar plot in orange. In contrast, our objective allows a uniformity constraint on the weight of the centroids, inducing a more homogeneous allocation. This is important in applications where an imbalanced allocation is undesirable, such as the placement of hospitals or schools.
Fig. 8 shows the approximation of the density of a mixture of Gaussians by a uniform distribution over atoms with a polynomial kernel of degree 1.5, similar to the approximate super-samples (Chen et al., 2010) task presented by Claici et al. (2018) using the Wasserstein distance. We minimize with respect to the locations . We estimate the continuous expectations with respect to by repeatedly sampling minibatches to construct an empirical measure . Note how the solution is a “uniformly spaced” allocation of the atoms through the space, with the number of points in a given region being proportional to mass of the region. See App. D for a comparison to Claici et al. (2018).
Finally, one can approximate a measure when the locations of the atoms are fixed. As an example, we take an article from the News Commentary Parallel Corpus (Tiedemann, 2012), using as a measure the normalized TF-IDF weights of each non-stopword in the article. Here, is given by an RBF kernel applied to the -dimensional GLoVe (Pennington et al., 2014) embeddings of each word. We optimize applying a penalty to encourage sparsity. We show the result of this summarization in word-cloud format in Fig. 8. Note that compared to TF-IDF, which places most mass on a few unusual words, our method produces a summary that is more representative of the original text. This behavior can be modified by varying the bandwidth of the kernel, producing approximately the same result as TF-IDF when is very small; details are presented in App. D.3.
4.3 Measuring diversity and counting modes
As mentioned earlier, the exponential of the entropy provides a measure of the effective number of points in the space (Leinster, 2013). In Fig. 10, we use an empirical distribution to estimate the number of modes of a mixture of Gaussians. As the kernel bandwidth increases, decreases, with a marked plateau around . We highlight that the lack of direct consideration of geometry of the space in the Shannon entropy renders it useless here: at any (non-trivial) scale, equals the number of samples, and not the number of classes. Our approach obtains similar results as (a form of) the birthday paradox-based method of Arora et al. (2018), while avoiding the need for human evaluation of possible duplicates. Details and tests on MNIST can be found in App. E.
5 Conclusions
In this paper, we advocate the use of geometry-aware information theory concepts in machine learning. We present the similarity-sensitive entropy of Leinster and Cobbold (2012) along with several important properties that connect it to fundamental notions in geometry. We then propose a divergence induced by this entropy, which compares probability distributions by taking into account the similarities among the objects on which they are defined. Our proposal shares the empirical performance properties of distances based on optimal transport theory, such as the Wasserstein distance (Villani, 2008), but enjoys a closed-form expression. This obviates the need to solve a linear program or use matrix scaling algorithms (Cuturi, 2013), reducing computation significantly. Finally, we also propose a similarity-sensitive version of mutual information based on the GAIT entropy. We hope these methods can prove fruitful in extending frameworks such as the information bottleneck for representation learning (Tishby and Zaslavsky, 2015), similarity-sensitive cross entropy objectives in the spirit of loss-calibrated decision theory (Lacoste-Julien et al., 2011), or the use of entropic regularization of policies in reinforcement learning (Fox et al., 2016).
Acknowledgments
This research was partially supported by the Canada CIFAR AI Chair Program and by a Google Focused Research award. Simon Lacoste-Julien is a CIFAR Associate Fellow in the Learning in Machines & Brains program. We thank Pablo Piantanida for the great tutorial on information theory which inspired this work, and Mark Meckes for remarks on terminology and properties of metrics spaces of negative type.
Appendix A Revisiting parallel lines
Let , , and let be the distribution of , i.e., a (degenerate) uniform distribution on the segment , illustrated in Fig. 11.
Our goal is to find the right value of for a model distribution using the dissimilarity with respect to a target distribution as a learning signal. The behavior of common divergences on this type of problem was presented by Arjovsky et al. (2017) as a motivating example for the introduction of OT distances in the context of GANs.
[TABLE]
Note that among all these divergences, illustrated in Fig. 12, only the Wasserstein distance provides a continuous (even a.e. differentiable) objective on . We will now study the behavior of the GAIT divergenve in this setting.
Recall that the action of the kernel on a given probability measure corresponds to the mean map , defined by . In particular, for :
[TABLE]
Let us endow with the Euclidean norm , and define the kernel . Note that this choice is made only for its mathematically convenience in the following algebraic manipulation, but other choices of kernel are possible. In this case, the mean map reduces to:
[TABLE]
We obtain the following expressions for the terms appearing in the divergence:
[TABLE]
[TABLE]
Finally, we replace the previous expressions in the definition of the GAIT divergence. Remarkably, the result is a smooth function of the parameter with a global optimum at . See Fig. 12.
[TABLE]
Appendix B Proofs
Theorem 3**.**
Let , be independent, then
Proof.
[TABLE]
∎
Theorem 4**.**
For any kernel , .
Proof.
[TABLE]
∎
Theorem 5**.**
♣* For any similarity kernel , *
Proof.
∎
Lemma 1**.**
(Chain Rule of Mutual Information)♣.
Proof.
By definition:
[TABLE]
Thus, ∎
Theorem 6**.**
(Data Processing Inequality)♣*.
If is a Markov chain, *
Proof.
[TABLE]
Therefore Finally, we have that , which in turn implies that ∎
Additionally, we are able to prove a series of inequalities illuminating the influence of the similarity matrix on joint entropy in extreme cases:
Theorem 7**.**
For any similarity kernels and ,
Proof.
The first result, follows by noting that for all :
[TABLE]
follows by monotonicity of the entropy in the similarity matrices.
follows by the chain rule of conditional entropy. ∎
Appendix C Verifying the concavity of
C.1 Proof attempts
We have made several attempts to show that the GAIT entropy is a concave function at . As this is a critical component in our theoretical developments, we provide a list of our previously unsuccessful approaches, in the hopes of facilitating the participation of interested researchers in answering this question.
- •
Jensen’s inequality for the or terms is too loose.
- •
The bound applied to the ratio results in a loose bound.
- •
is known to be a concave function. However, the action of the similarity matrix on the distribution inside the logarithmic factor in complicates the analysis.
- •
The Donsker-Varadhan representation of the Kullbach-Leibler divergence goes in the wrong direction and adds extra terms.
- •
Bounding a Taylor series expansion of the gap between the linear approximation of an interpolation and the value of the entropy along the interpolation. The analysis is promising but becomes unwieldy due to the presence of terms.
C.2 Positive definiteness of the Hessian of the negative entropy
Straightforward computation based on the definition of the GAIT entropy leads to a remarkably simple form for the Hessian of the negative entropy.
Theorem 8**.**
[TABLE]
Moreover, is positive definite in the case.
The proving of the conjecture is equivalent proving positive definiteness of the matrix presented above. Furthermore, since we are interested in the behavior of the GAIT entropy operating on probability distributions, it is even sufficient to only consider the action of this matrix as a quadratic form the set of mass-preserving vectors with entries adding up to zero.
C.3 Numerical experiments
Random search on . We perform a search over vectors and drawn randomly from the simplex, and over random positive definite similarity Gram matrices . We have tried restricting our searches to and near the center of the simplex and away from the center, and to closer to or . In every experiment, we find that .
Consider the wide experimental setup for search defined in Tab. C.3. Fig. 13 shows the histogram of over this search, empirically showing the non-negativity of the divergence, and, thus the concavity of the GAIT entropy.
[TABLE]
Random search on .
We empirically study the positive definiteness of this matrix via its spectrum. For this, we sample a set of points in as well as a (discrete) distribution over those points. Then we construct the Gram matrix induced by the kernel . The location of the points, , , and are sampled randomly.
We performed extensive experiments under this setting and never encountered an instance such that would have any negative eigenvalues. We believe this experimental setting is more holistic than the above experiments since it considers the whole spectrum of the (negative) Hessian rather than a “directional derivative” towards another sampled distribution .
Optimization. As an alternative to random search, we also use gradient-based optimization on , and to minimize . Starting from random initializations, our objective function always converges to values very close to (yet above) zero.
Furthermore, freezing and optimizing over either or while holding the other fixed, results in at convergence. On the other hand, if and are fixed such that , optimization over converges to . We note from the definition of the GAIT divergence that when or , , which matches the value we obtain at convergence when trying to minimize this quantity.
Recall that the experiments presented in Sec. 4 involve the minimization of some GAIT divergence. We never encountered a negative value for the GAIT divergence during any of these experiments.
C.4 Finding maximum entropy distributions with gradient ascent
An algorithm with an exponential run-time to find exact maximizers of the entropy is presented in Leinster and Meckes (2016). We exploit the fact that the objective is amenable to gradient-based optimization techniques and conduct experiments in spaces with thousands of elements. This also serves as an empirical test for the conjecture about the concavity of the function: there must be a unique maximizer for if it is concave.
We test our ability to find distributions with maximum GAIT entropy via gradient descent. We sample 1000 points in dimensions 5 and 10, and construct a similarity space using a RBF kernel with . Then we perform 100 trials by setting the logits of the initialization using a Gaussian distribution with variance 4 for each of the 1000 logits that describe our distribution. We use Adam with learning rate 0.1 and . The optimization results are shown in Fig. 14. We reliably obtain negligible variance in the objective value at convergence across random initializations, thus providing an efficient alternative for finding approximate maximum-entropy distributions.
Appendix D Interpolation and Approximation
In all experiments for Figs. 8-9, we minimize the GAIT divergence using AMSGrad (Reddi et al., 2018) in PyTorch (Paszke et al., 2019). We parameterize the weights of empirical distributions using a softmax function on a vector of temperature-scaled logits. All experiments in the section are run on a single CPU.
D.1 Approximating measures with finite samples
In Fig. 8 we optimize our approximating measure using Adam for 3000 steps with a learning rate of and minibatches constructed by sampling 50 examples at each step. We use a Gaussian kernel with .
In Fig. 8, we approximate a continuous measure with an empirical measure supported on 200 atoms. We execute Adam for 500 steps using a learning rate of and minibatches of 100 samples from the continuous measure to estimate the discrepancy. The similarity function is given by a polynomial kernel with exponent 1.5: . Fig. 15 shows that we achieve results of comparable quality to those of Claici et al. (2018)
D.2 Image barycenters
We compute barycenters for each class of MNIST and Fashion-MNIST. We perform gradient descent with Adam using a learning rate of with minibatches of size 32 for 500 optimization steps. We use a Gaussian kernel with . The geometry of the grid on which images are defined is given by the Euclidean distance between the coordinates of the pixels. In Fig. 16, we provide barycenters for the each of classes of Fashion MNIST computed via a combination of the methods of Benamou et al. (2014) and Cuturi and Doucet (2014).
D.3 Text summarization
For our text example, we use the article from the STAT-MT parallel news corpus titled “Why Wait for the Euro?”, by Leszek Balcerowicz. The full text of the article can be found at https://pastebin.com/CnBgbpsJ. We use the 300-dimensional GLoVe vectors found at http://nlp.stanford.edu/data/glove.6B.zip as word embeddings. TF-IDF is calculated over the entire English portion of the parallel news corpus using the implementation in Scikit-Learn (Pedregosa et al., 2011). We filter stopwords based on the list provided by the Natural Language Toolkit (Bird et al., 2009). To encourage sparsity in the approximating measure , we add the -norm of to the divergence loss, weighted by a factor of . We optimize the loss with gradient descent using Adam optimizer, with hyperparameters , for 25,000 iterations. Since a truly sparse is not reachable using the softmax function and gradient descent, we set all entries to be 0 and renormalize after the end of training. is represented by the softmax function, and is initialized uniformly.
We examine the influence of varying in Fig. 17. Decreasing leads to approaching , and the resulting similarity more closely approximates the original measure. As approaches 0.01, the two measures become almost identical. See Fig. 17, bottom-left and bottom-right.
Appendix E GAN evaluation and mode counting
When the data available takes the form of many i.i.d. samples from a continuous distribution, a natural choice is to generate a Gram matrix using a similarity measure such as an RBF kernel .
For comparison, we adapt the birthday paradox-based approach of Arora et al. (2018). Strictly speaking, their method requires human evaluation of possible duplicates, and is thus not comparable to our approach. As such, we propose an automated version using the same assumptions. We define and as colliding when , and note that the expected number of collisions for a distribution with support in a sample of size is . We can thus estimate . When varying , we observe behavior very similar to that of our entropy measure, with a plateau at in our example of a mixture of Gaussians. The results of this comparison are presented in Fig. 10.
To test this on a more challenging dataset, we use a 2-dimensional representation for MNIST obtained using UMAP (McInnes et al., 2018), shown in Fig. 18. Although our method no longer shows a clear plateau at , it does transition from exponential to linear decay at approximately this point, which coincides with the point of minimum curvature with respect to , . Similar behavior is observed in the case with birthday-inspired estimate; here the point of minimum curvature has .
Finally, we also apply this method to evaluating the diversity of GAN samples. We train a simple WGAN (Arjovsky et al., 2017) on MNIST, and find that the assessed entropy increases steadily as training progresses and the generator masters more modes (see Fig. 19). Note that the entropy estimate stabilizes once the generator begins to produce all 10 digits, but long before sample quality ceases improving.
In all of the experiments corresponding to mode counting, we use and the standard RBF kernel . Note that this differs from the kernel given in Section 2 by using squared Euclidean distance rather than Euclidean distance. To estimate the point with minimum curvature, we find the value of or at 100 values of or evenly spaced between and , and empirically estimate the second derivative with respect to or . In the case of the birthday estimate, which is not continuous on finite sample sizes, we use a Savitzky-Golay filter (Savitzky, 1964) of degree 3 and window size 11 to smooth the derivatives. We estimate the point of minimum curvature to be the first point when the absolute second derivative passes below .
To evaluate GANs, we train a simple WGAN-GP (Gulrajani et al., 2017) with a 3-hidden-layer fully-connected generator, using the ReLU nonlinearity and 256 units in each hidden layer, on a TITAN Xp GPU. Our latent space has 32 dimensions sampled i.i.d. from and the discriminator is trained for four iterations for each generator update. We use the Adam with learning rate and , . The weight of the gradient penalty in the WGAN-GP objective is set to .
To count the number of modes in the output of the generator, we use an instance of UMAP fitted to the entire training set of MNIST to embed all input in . We use 1,000 samples of true MNIST data to estimate values of (for our entropy method) and for the birthday paradox-based method that minimize curvature and yield estimates of and . We then apply these methods to the output of the generator after each of the first 30 epochs, and report the resulting or .
Appendix F Generative models
For all the generative models in Section 4.1, we employ an experimental setup similar to the setup used by Genevay et al. (2018) for learning generative models on MNIST. Thus, our generative model is a -layer multilayer perceptron with one hidden layer of 500 dimensions with ReLU non-linearities, using a D latent space, trained using mini-batches of size . Note that their method requires a batch size of to get reasonable generations, but we also obtain comparable results with a significantly smaller batch size of . Since Genevay et al. (2018) sample latent codes from a unit square, we do the same for MNIST here for easy comparison but sample from a standard Gaussian for Swiss roll and Fashion-MNIST datasets. We train our models by minimizing , where is the target empirical measure and is the model. is the Gram matrix corresponding to a RBF kernel with for Swiss roll data, and for MNIST and Fashion-MNIST. We use Adam with a learning rate of to train our models. Fig. 20 compares the manifolds learned by minimizing our divergence with batch sizes and with that learned by minimizing the Sinkhorn loss (Genevay et al., 2018) for MNIST.
We further compare our generations with those done by variational auto-encoders (Kingma and Welling, 2014). Following their setup, we use as the non-linearity in the -layer multilayer perceptron and a lower batch size of , along with the latent codes sampled from a standard Gaussian distribution. We compare our results with theirs in Fig. 21. Both figures are generated using latent codes obtained by taking the inverse c.d.f. of the Gaussian distribution at the corresponding grid locations, similar to the work of Kingma and Welling (2014).
Finally, in Fig. 22, we illustrate Fashion-MNIST and MNIST samples generated by our generative model with a D latent space. The quality of our generations with a D latent space is comparable to the samples generated by the variational auto-encoder with the same latent dimensions in Kingma and Welling (2014).
Appendix G Computational complexity
Solomon et al. (2015) shows how the computation of can be efficiently performed using convolutions in the case of image-like data. For images, this takes time , instead of using a naive approach. Sinkhorn regularized optimal transport requires performing this computation this computation , which highlights the value of the work of Solomon et al. (2015) for applications with large . The complexity for computing the close-form GAIT divergence is thus , and the cost for approximating solving the optimal transport problem via Sinkhorn iterations is . We draw the attention of the reader to the distinction between the width of the image, and the size of the support of the measures, .
Fig. 23 shows compares the time required by the convolutional approaches of the GAIT divergence computation and the Sinkhorn algorithm approximating the Sinkhorn divergence, between two images of size . Genevay et al. (2018) found necessary to perform well on generative modeling. Even for the comparatively low values of presented in Fig. 23, we observe that the computation of the GAIT divergence is significantly faster than that of the approximate Sinkhorn divergence. It is possible to compute the GAIT divergence between two images of one megapixel in a quarter of a second (horizontal line).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alemi et al. (2018) A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy. Fixing a Broken ELBO. In ICML , 2018.
- 2Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. ar Xiv preprint ar Xiv:1701.07875 , 2017.
- 3Arora et al. (2018) S. Arora, A. Risteski, and Y. Zhang. Do GA Ns Learn the Distribution? Some Theory and Empirics. In ICLR , 2018.
- 4Ben-David and Shalev-Shwartz (2014) S. Ben-David and S. Shalev-Shwartz. Understanding Machine Learning: From Theory to Algorithms . Cambridge University Press, 2014.
- 5Benamou et al. (2014) J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative Bregman Projections for Regularized Transportation Problems. SIAM Journal on Scientific Computing , 37(2):A 1111–A 1138, 2014.
- 6Bird et al. (2009) S. Bird, E. Loper, and E. Klein. Natural Language Processing with Python . O’Reilly Media Inc., 2009.
- 7Charpentier (2012) A. Charpentier. French dataset: population and GPS coordinates, 2012.
- 8Chen et al. (2010) Y. Chen, M. Welling, and A. Smola. Super-samples from Kernel Herding. In UAI , 2010.
