Information gains from Monte Carlo Markov Chains

Ahmad Mehrabi; A. Ahmadi

arXiv:1904.11920·astro-ph.CO·April 29, 2019

Information gains from Monte Carlo Markov Chains

Ahmad Mehrabi, A. Ahmadi

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new numerical method and Python package for efficiently estimating relative entropy and expected relative entropy from MCMC samples, aiding model comparison and experiment design in cosmology.

Contribution

The paper presents a novel approach and software tool for computing information-theoretic quantities from MCMC chains, addressing computational challenges in non-Gaussian models.

Findings

01

Relative error below 0.2% for sample size > 10^5 in Gaussian models

02

Method is robust for estimating expected relative entropy

03

Provides a practical tool for cosmological data analysis

Abstract

In this paper, we present a novel method for computing the relative entropy as well as the expected relative entropy using an MCMC chain. The relative entropy from information theory can be used to quantify differences in posterior distributions of a pair of experiments. In cosmology, the relative entropy has been proposed as an interesting tool for model selection, experiment design, forecasting and measuring information gain from subsequent experiments. In contrast to Gaussian distributions, these quantities are not generally available analytically and one needs to use numerical methods to estimate them which are certainly computationally expensive. We propose a method and provide its python package to estimate the relative entropy as well as expected relative entropy from a posterior sample. We consider the linear Gaussian model to check the accuracy of our code. Our results indicate…

Equations30

P (Θ∣ D) = \frac{L ( Θ ; D ) π ( Θ )}{E ( D )},

P (Θ∣ D) = \frac{L ( Θ ; D ) π ( Θ )}{E ( D )},

E (D) = \int d Θ L (Θ; D) π (Θ) .

E (D) = \int d Θ L (Θ; D) π (Θ) .

D (P_{2} ∣∣ P_{1}) \equiv \int d Θ P_{2} (Θ) lo g \frac{P _{2} ( Θ )}{P _{1} ( Θ )} .

D (P_{2} ∣∣ P_{1}) \equiv \int d Θ P_{2} (Θ) lo g \frac{P _{2} ( Θ )}{P _{1} ( Θ )} .

D (P_{2} ∣∣ P_{1}) = \frac{1}{2} (Θ_{1} - Θ_{2})^{T} Σ_{1}^{- 1} (Θ_{1} - Θ_{2})

D (P_{2} ∣∣ P_{1}) = \frac{1}{2} (Θ_{1} - Θ_{2})^{T} Σ_{1}^{- 1} (Θ_{1} - Θ_{2})

+ \frac{1}{2} (tr (Σ_{2} Σ_{1}^{- 1}) - d - lo g \frac{det ( Σ _{2} )}{det ( Σ _{1} )}) .

D (P_{2} ∣∣ P_{1}) = - lo g E (D) + \int d Θ P_{2} (Θ) lo g L (Θ) .

D (P_{2} ∣∣ P_{1}) = - lo g E (D) + \int d Θ P_{2} (Θ) lo g L (Θ) .

< D >= \int d D^{'} P (D^{'} ∣ D) D (P_{2} ∣∣ P_{1}),

< D >= \int d D^{'} P (D^{'} ∣ D) D (P_{2} ∣∣ P_{1}),

P (D^{'} ∣ D) = \int d Θ P (Θ∣ D) P (D^{'} ∣Θ) .

P (D^{'} ∣ D) = \int d Θ P (Θ∣ D) P (D^{'} ∣Θ) .

S = D - < D >,

S = D - < D >,

S = \frac{1}{2} {(Θ_{1} - Θ_{2})^{T} Σ_{1}^{- 1} (Θ_{1} - Θ_{2}) - tr (1 \pm Σ_{2} Σ_{1}^{- 1})},

S = \frac{1}{2} {(Θ_{1} - Θ_{2})^{T} Σ_{1}^{- 1} (Θ_{1} - Θ_{2}) - tr (1 \pm Σ_{2} Σ_{1}^{- 1})},

< D >\approx \frac{1}{n} i = 1 \sum n {lo g P (D_{2}^{i} ∣ Θ_{i}) - lo g P (D_{2}^{i} ∣ D_{1})},

< D >\approx \frac{1}{n} i = 1 \sum n {lo g P (D_{2}^{i} ∣ Θ_{i}) - lo g P (D_{2}^{i} ∣ D_{1})},

P (D_{2}^{i} ∣ D_{1}) \approx \frac{1}{n} j = 1 \sum n P (D_{2}^{i} ∣ Θ_{j}) .

P (D_{2}^{i} ∣ D_{1}) \approx \frac{1}{n} j = 1 \sum n P (D_{2}^{i} ∣ Θ_{j}) .

L (Θ; D) \sim e^{(D - F (Θ))^{T} Σ^{- 1} (D - F (Θ))},

L (Θ; D) \sim e^{(D - F (Θ))^{T} Σ^{- 1} (D - F (Θ))},

F (Θ) = F_{0} + M Θ,

F (Θ) = F_{0} + M Θ,

Σ_{2} = (Σ_{1}^{- 1} + M^{T} C^{- 1} M)^{- 1}

Σ_{2} = (Σ_{1}^{- 1} + M^{T} C^{- 1} M)^{- 1}

Θ_{2} = Σ_{2} (Σ_{1}^{- 1} Θ_{1} + M^{T} C^{- 1} (D - F_{0}))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ahmadiphy/MCKLdivergence
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStatistical Mechanics and Entropy · Gaussian Processes and Bayesian Inference · Markov Chains and Monte Carlo Methods

Full text

Information gains from Monte Carlo Markov Chains

Ahmad Mehrabi

A. Ahmadi

Department of Physics, Bu-Ali Sina University, Hamedan, Iran

Abstract

In this paper, we present a novel method for computing the relative entropy as well as the expected relative entropy using an MCMC chain. The relative entropy from information theory can be used to quantify differences in posterior distributions of a pair of experiments. In cosmology, the relative entropy has been proposed as an interesting tool for model selection, experiment design, forecasting and measuring information gain from subsequent experiments. In contrast to Gaussian distributions, these quantities are not generally available analytically and one needs to use numerical methods to estimate them which are certainly computationally expensive. We propose a method and provide its python package to estimate the relative entropy as well as expected relative entropy from a posterior sample. We consider the linear Gaussian model to check the accuracy of our code. Our results indicate that the relative error is below $0.2\%$ for sample size larger than $10^{5}$ in the linear Gaussian model. In addition, we study the robustness of our code in estimating the expected relative entropy in this model.

I Introduction

In contrast to a few decades ago, there are a large number of probes in cosmology, which provide us remarkable information about content and evolution of the Universe. These data sets have been extensively used to study and constrain free model parameters in literature (see Mehrabi et al. (2015, 2017); Rezaei et al. (2017); Mehrabi (2018)and references therein). Bayesian inference provides a common and widely used method to constrain free model parameters. In this case, we update a prior probability density in parameter space to obtain posterior distribution using an observational data. Since an analytic solution in the Bayesian inference is very limited, one has to develop a numerical method to find the posterior. Among these, the Monte Carlo Markov Chain (MCMC) techniques are widely accepted and used in different problems. The purpose of an MCMC algorithm is to construct a sample of points in parameter space which is called a chain and then obtain posterior probability density from it. The simplest and widely used MCMC algorithm is Metropolis-Hasting Hastings (1970) but considering different situations other algorithms like Gibbs sampling T. and van Dyk D.A. (2009); Y. and X.L. (2011) and Hamilton Monte-Carlo Neal (2011) have been used to obtain the posterior distributions. To quantify the difference between probability distributions from different surveys, a robust framework is needed.

Initially motivated from information theory, the relative entropy or Kullback-Leibler divergence has been proposed to measure differences in two probability densities S. and A. (1951) . In addition, this method has been used for experiment design and forecasting Farhang et al. (2013); Paykari and Jaffe (2013); Amara and Refregier (2014) as well as model selection Kunz et al. (2006); Verde et al. (2013) in cosmology. Moreover, the relative entropy has been introduced as a new tool to measure information gain from successive experiment Seehars et al. (2014); Grandis et al. (2016) as well as a tool to measuring tensions among datasets within a given model Seehars et al. (2016); Nicola et al. (2019). The relative entropy quantifies both statistical precision and shifts of confidence regions and by disentangling these contributions, one can measure change of confidence regions and shifts of parameters from two different datasets. In the limit of Gaussian distributions, the relative entropy has an analytic solution but in a general case, one should use numerical method to obtain it. Since in most cases the probability distributions, coming from a MCMC chain, it would be a remarkable task to provide the relative entropy from an MCMC chain. In this work, we introduce a method and provide a python package to estimate the relative entropy from an MCMC chain.

Given two datasets $D_{1}$ and $D_{2}$ , it is straightforward to find the posterior probability distributions and then the relative entropy between these two distributions. As we mentioned above, the relative entropy consist of two contributions, information gain in precision and shifts in parameter space. To distinguish these two contributions, one can use the constrains from $D_{1}$ dataset to anticipate the expected relative entropy for $D_{2}$ dataset by assuming both datasets are described by the same model. The difference between the relative entropy and the expected one is called surprise and has been introduced in Seehars et al. (2014) as a remarkable tool to measure consistency between datasets in a given model. The expected relative entropy has an analytic solution in the case of two Gaussian distributions but for a general case, we need to use a numerical method to estimate it. To do this, many algorithms have been proposed in literature C. et al. (2013); Q. et al. (2013); X. and M. (2013). In this work, we introduce a python package to estimate the expected relative entropy base on the algorithm proposed in X. and M. (2013).

This work is organized as follows. In section II we review the formalism of relative entropy and present it for two Gaussian distributions. In section III, we argue about the surprise and its close-form in the Gaussian limit. In section IV, we present the linear Gaussian model and compare the exact results from those of our code to check its accuracy. Finally, in V, we conclude and highlight importance of our method.

II Information gain base on the relative entropy

The likelihood is the probability of the data $\mathcal{D}$ given the value of the parameters and is a crucial quantity in parameter inference. Given a likelihood, it is straightforward to update a prior information on parameters $\pi(\Theta)$ to obtain the posterior $P(\Theta|\mathcal{D})$ through Bayes’ theorem:

[TABLE]

where $\mathcal{L}(\Theta;\mathcal{D})$ is the likelihood function for the data and the denominator is the Bayesian evidence which is given by

[TABLE]

In this process, one can measure information gain from updating the prior to posterior via the relative entropy or Kullback-Leibler. The relative entropy between two probability distributions $P_{1}(\Theta)$ and $P_{2}(\Theta)$ is given by:

[TABLE]

The relative entropy is always positive and equals to zero only for $P_{1}(\Theta)=P_{2}(\Theta)$ . Apart from not being symmetric in $P_{1}$ and $P_{2}$ , the relative entropy is invariant under invertible transformations in $\Theta$ .

For two Gaussian distributions $P_{1}(\Theta)=\mathcal{N}(\Theta;\Theta_{1},\Sigma_{1})$ and $P_{2}(\Theta)=\mathcal{N}(\Theta;\Theta_{2},\Sigma_{2})$ the relative entropy is given by:

[TABLE]

The first term measure the significance of mean shift and the second term quantifies change in precision. In a general case, the probability distributions are not Gaussian so developing a method to estimate the relative entropy for any arbitrary distributions, is a remarkable task. Assuming $P_{2}(\Theta)$ as a posterior and using Eq.(1), the relative entropy can be rewritten as:

[TABLE]

Given a sample of posterior, the second integral can be easily estimated from $<\log\mathcal{L}(\Theta)>_{P_{2}}$ so knowing the evidence ,one can estimate the relative entropy from a sample for any arbitrary distributions. We provide a python package (available at https://github.com/ahmadiphy/MCKLdivergence) to estimate the relative entropy using Eq.(5). Inputs are a sample chain and $\log(\rm{prior})$ at each sample in the chain. The code estimate the first term using method introduced in Heavens et al. (2017) using kth nearest-neighbour distances in parameter space.

III Expected relative entropy and surprise

Considering a prior and likelihood function, it is possible to find several realizations of data. Assuming $P(\mathcal{D}^{\prime}|\mathcal{D})$ as the probability of obtaining $\mathcal{D}^{\prime}$ given $\mathcal{D}$ , the expected relative entropy is given by

[TABLE]

where $P(\mathcal{D}^{\prime}|\mathcal{D})$ is given by:

[TABLE]

Notice that, it is possible to consider the prior from one dataset for example $\mathcal{D}_{1}$ and likelihood from $\mathcal{D}_{2}$ so the expected relative entropy can be estimated between two datasets. The surprise is defined via Seehars et al. (2014)

[TABLE]

which scatters around zero. A positive value of S indicates that posterior is more different that what we expect and a negative value means the constrains are more consistent than expected a priori.

It has been proved that S follows a generalized $\chi^{2}$ distribution Seehars et al. (2014) for Gaussian distributions and given a particular value of S, one can measure the probability for measuring S that deviates from zero by more than S. This quantity is the so called p-value for hypothesis that both datasets are consistent within the considered model and a small p-value indicates evidence against the hypothesis.

In the limit of two Gaussian posteriors, the surprise is given by

[TABLE]

where the - holds when posterior of $D_{1}$ is used as a prior to obtain posterior of $D_{2}$ and the + holds when a wide prior is used for both posteriors. Since in a general case, posteriors derived from $D_{1}$ and $D_{2}$ are not Gaussian, we need a general approach for obtaining surprise.

In the Bayesian experimental design, the expected relative entropy is a well known quantity. In fact many algorithms have been proposed to obtain this quantity in a general non-linear cases C. et al. (2013); Q. et al. (2013); X. and M. (2013). Among these, a simple method has been proposed in X. and M. (2013), where the expected relative entropy can be estimated from

[TABLE]

where the second term can be estimated from

[TABLE]

In the above formula, $\Theta_{i}$ is a sample from $\mathcal{D}_{1}$ posterior and $\mathcal{D}_{2}^{i}$ is a sample of simulated data from $\mathcal{D}_{2}$ likelihood. Given a sample of $\Theta$ from $\mathcal{D}_{1}$ posterior and the likelihood function of $\mathcal{D}_{2}$ , it is possible to estimate the expected relative entropy and then the surprise from the above formula. We provide a class in our python package to estimate the expected relative entropy using above algorithm. In this case, inputs are a sample of $\Theta$ from $\mathcal{D}_{1}$ posterior, the covariance of the likelihood $(\Sigma)$ , the model function $F(\Theta)$ , number of sample $(l)$ to be used to estimate expected relative entropy and the dimension of the data $(n)$ . The code uses $l$ given sample value to simulate $l$ data from a Gaussian likelihood

[TABLE]

and then uses them to estimate the expected relative entropy from Eq.(10). Notice that the user defined model function should return a vector of dimension $n$ . The current version of the code adopt only a Gaussian likelihood and we plan to update it to a general case in subsequent updates.

IV Linear Gaussian model and accuracy of the method

In order to check the accuracy of our code, we consider the linear Gaussian model which has an analytic solution for both relative entropy and expected relative entropy. We consider a Gaussian prior $\mathcal{N}(\Theta;\Theta_{1},\Sigma_{1})$ on $\Theta$ and a Gaussian likelihood $\mathcal{N}(\mathcal{D};F(\Theta),\mathcal{C})$ on the data. The model function $F(\Theta)$ must be linear in $\Theta$ , we assume

[TABLE]

where $M_{ij}=X^{j}(x_{i})$ is a matrix evaluated at some arbitrary points $x_{i}$ and $F_{0}$ is a constant vector. In this formalism, $X(x)$ are the basic functions which can very well be a non-linear function of $x$ . Notice that the model function must be linear in $\Theta$ , not necessarily in the basic functions. Using Bayes’ theorem, the posterior is a Gaussian with the following covariance and mean

[TABLE]

Having both Gaussian prior and posterior, we can use Eq.(3) and Eq.(6) to compute the relative and expected relative entropy. After this, we generate a sample chain using Hamilton algorithm and compute these quantities by our code. The results in 3 dimensions have been shown in Figs (1 and 2).In the case of relative entropy, for $k=1$ and with $10^{4}$ samples, the relative error is around $0.6\%$ but it decreases for a larger k and is below $0.2\%$ for $k=4$ with $10^{5}$ samples. As we show in Fig (1), the relative entropy decreases like a power law by increasing size of the sample. The expected relative entropy also decreases like a power law by increasing size of samples and goes below $0.2\%$ with $n=3\times 10^{4}$ samples. Since the expected relative entropy is computationally expensive, the code provide a class to parallel all computations using the MPI.

Moreover, we repeat above computation for a 5 dimensions case to study how increasing dimension affects the results. The results in this case, are presented in Figs (3 and 4)

In contrast to 3D, different value of k gives almost the same results. In this case the relative error for $10^{5}$ samples is around $0.3\%$ which indicates robustness of our code. Similar to the 3D case, the expected relative entropy decreases like a power law and is around $0.5\%$ for $3\times 10^{4}$ samples. Note that, error in the expected relative entropy for 5D case is relatively larger than 3D with small number of samples.

V Conclusion

In this work, we introduce a novel method and provide its python package to compute relative and expected relative entropy. The relative entropy quantifies amount of information in updating from a prior to a posterior probability densities. Since in the most cases of Bayesian inference, we use an MCMC algorithm to generate a sample of posterior, our code use the chain information alongside the $\log(\rm prior)$ to estimate the relative entropy from the chain. For expected relative entropy, the relative error between the exact and estimated value in a linear Gaussian model are around $0.2\%$ and $0.5\%$ for $10^{5}$ samples in 3D and 5D respectively. Since there is no closed-form solution for the relative entropy in the case of an arbitrary probability distributions, The code would be useful to estimate amount of information gain in updating from a prior to posterior in a general case.

In addition to the relative entropy, there are some algorithms to estimate the expected relative entropy. The expected relative entropy has been used to define a quantity so called surprise. The surprise is a measurement of consistency between posterior distributions and can be used to quantify possible tension between data sets within a model. Given a sample of first posterior and likelihood function of the second data set, our code provides an estimation of expected relative entropy base on the algorithm presented in X. and M. (2013). Since the linear Gaussian model has an analytic solution, we compare the estimated value with the exact one in 3D and 5D to check the robustness of our code. The relative errors in this case are around $0.2\%$ and $0.5\%$ for $3\times 10^{4}$ samples in 3D and 5D respectively. The code is available in Github. Alongside the code, there are two examples for computing the relative entropy and the expected relative entropy in the case of linear Gaussian model.

Bibliography23

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Heavens et al. (2017) A. Heavens, Y. Fantaye, A. Mootoovaloo, H. Eggers, Z. Hosenie, S. Kroon, and E. Sellentin, (2017), ar Xiv:1704.03472 [stat.CO] .
2Mehrabi et al. (2015) A. Mehrabi, S. Basilakos, and F. Pace, Mon. Not. Roy. Astron. Soc. 452 , 2930 (2015) , ar Xiv:1504.01262 [astro-ph.CO] . · doi ↗
3Mehrabi et al. (2017) A. Mehrabi, F. Pace, M. Malekjani, and A. Del Popolo, Mon. Not. Roy. Astron. Soc. 465 , 2687 (2017) , ar Xiv:1608.07961 [astro-ph.CO] . · doi ↗
4Rezaei et al. (2017) M. Rezaei, M. Malekjani, S. Basilakos, A. Mehrabi, and D. F. Mota, Astrophys. J. 843 , 65 (2017) , ar Xiv:1706.02537 [astro-ph.CO] . · doi ↗
5Mehrabi (2018) A. Mehrabi, Phys. Rev. D 97 , 083522 (2018) , ar Xiv:1804.09886 [astro-ph.CO] . · doi ↗
6Hastings (1970) W. Hastings, Biometrika 57 , 109 (1970).
7T. and van Dyk D.A. (2009) P. T. and van Dyk D.A., Journal of Computational and Graphical Statistics 18 , 283 (2009).
8Y. and X.L. (2011) Y. Y. and M. X.L., Journal of Computational and Graphical Statistics 20 , 531 (2011).