Markov Chain Monte Carlo Methods for Bayesian Data Analysis in Astronomy

Sanjib Sharma

arXiv:1706.01629·astro-ph.IM·August 30, 2017

Markov Chain Monte Carlo Methods for Bayesian Data Analysis in Astronomy

Sanjib Sharma

PDF

1 Repo

TL;DR

This paper reviews the use of Markov Chain Monte Carlo methods for Bayesian data analysis in astronomy, covering foundational theory, various algorithms, and advanced techniques, with software tools provided.

Contribution

It offers a comprehensive overview of MCMC methods in astronomical Bayesian analysis, including new algorithms and practical software implementations.

Findings

01

Enhanced MCMC algorithms for complex astronomical data

02

Software tools available for practitioners

03

Discussion of future directions in Bayesian analysis

Abstract

Markov Chain Monte Carlo based Bayesian data analysis has now become the method of choice for analyzing and interpreting data in almost all disciplines of science. In astronomy, over the last decade, we have also seen a steady increase in the number of papers that employ Monte Carlo based Bayesian analysis. New, efficient Monte Carlo based methods are continuously being developed and explored. In this review, we first explain the basics of Bayesian theory and discuss how to set up data analysis problems within this framework. Next, we provide an overview of various Monte Carlo based methods for performing Bayesian data analysis. Finally, we discuss advanced ideas that enable us to tackle complex problems and thus hold great promise for the future. We also distribute downloadable computer software (available at https://github.com/sanjibs/bmcmc/ ) that implements some of the algorithms…

Equations214

\overset{ˉ}{F} = \frac{\int F ( ω ) exp ( - E ( ω ) / k T ) d ω}{Z}

\overset{ˉ}{F} = \frac{\int F ( ω ) exp ( - E ( ω ) / k T ) d ω}{Z}

p (H ∣ I) + p (\overset{ˉ}{H} ∣ I) = 1

p (H ∣ I) + p (\overset{ˉ}{H} ∣ I) = 1

p (H, D ∣ I) = p (H ∣ D, I) p (D ∣ I) = p (D ∣ H, I) p (H ∣ I)

p (H ∣ D, I) = \frac{p ( D ∣ H , I ) p ( H ∣ I )}{p ( D ∣ I )}, Posterior = \frac{Likelihood \times Prior}{Evidence},

p (H ∣ D, I) = \frac{p ( D ∣ H , I ) p ( H ∣ I )}{p ( D ∣ I )}, Posterior = \frac{Likelihood \times Prior}{Evidence},

p (X ∣ I) = \int p (X, Y ∣ I) d Y = i \sum p (X, Y_{i} ∣ I) .

p (X ∣ I) = \int p (X, Y ∣ I) d Y = i \sum p (X, Y_{i} ∣ I) .

\sum p (Y_{i} ∣ I) = 1,

\sum p (Y_{i} ∣ I) = 1,

i \sum p (X, Y_{i} ∣ I)

i \sum p (X, Y_{i} ∣ I)

= p (X ∣ I) i \sum p (Y_{i} ∣ X, I) = p (X ∣ I)

p (x ∣ θ, σ_{x}) = \int f (x^{t} ∣ θ) p (x ∣ x^{t}, σ_{x}) d x^{t} .

p (x ∣ θ, σ_{x}) = \int f (x^{t} ∣ θ) p (x ∣ x^{t}, σ_{x}) d x^{t} .

p (x ∣ θ, θ_{b}, P_{b}, σ_{x})

p (x ∣ θ, θ_{b}, P_{b}, σ_{x})

=

p (X ∣ θ, θ_{b}, P_{b}, σ_{x})

p (X ∣ θ, θ_{b}, P_{b}, σ_{x})

p (θ, θ_{b}, P_{b} ∣ X, σ_{x})

p (θ, θ_{b}, P_{b} ∣ X, σ_{x})

p (y_{i} ∣ m, c, x_{i}, σ_{y, i}) = \frac{1}{2 π σ _{y, i}} exp (- \frac{( y _{i} - m x _{i} - b ) ^{2}}{2 σ _{y, i}^{2}})

p (y_{i} ∣ m, c, x_{i}, σ_{y, i}) = \frac{1}{2 π σ _{y, i}} exp (- \frac{( y _{i} - m x _{i} - b ) ^{2}}{2 σ _{y, i}^{2}})

p (y_{i} ∣ μ_{b}, σ_{b}, x_{i}, σ_{y, i}) = \frac{1}{2 π ( σ _{y, i}^{2} + σ _{b}^{2} )} exp (- \frac{( y _{i} - μ _{b} ) ^{2}}{2 ( σ _{y, i}^{2} + σ _{b}^{2} )})

p (y_{i} ∣ μ_{b}, σ_{b}, x_{i}, σ_{y, i}) = \frac{1}{2 π ( σ _{y, i}^{2} + σ _{b}^{2} )} exp (- \frac{( y _{i} - μ _{b} ) ^{2}}{2 ( σ _{y, i}^{2} + σ _{b}^{2} )})

p (Y ∣ m, c, P_{b}, μ_{b}, σ_{b}, X, σ_{y}) = i = 1 \prod N [p (y_{i} ∣ μ_{b}, σ_{b}, x_{i}, σ_{y, i}) P_{b} + p (y_{i} ∣ m, c, x_{i}, σ_{y, i}) (1 - P_{b})]

p (Y ∣ m, c, P_{b}, μ_{b}, σ_{b}, X, σ_{y}) = i = 1 \prod N [p (y_{i} ∣ μ_{b}, σ_{b}, x_{i}, σ_{y, i}) P_{b} + p (y_{i} ∣ m, c, x_{i}, σ_{y, i}) (1 - P_{b})]

p (θ) \propto det (I (θ))^{1/2}, where [I (θ)]_{ij} = \int p (x ∣ θ) \frac{\partial ^{2}}{\partial θ _{i} \partial θ _{j}} ln p (x ∣ θ) d x

p (θ) \propto det (I (θ))^{1/2}, where [I (θ)]_{ij} = \int p (x ∣ θ) \frac{\partial ^{2}}{\partial θ _{i} \partial θ _{j}} ln p (x ∣ θ) d x

p (μ_{1}, .. μ_{k}, θ) \propto det (I (θ))^{1/2},

p (μ_{1}, .. μ_{k}, θ) \propto det (I (θ))^{1/2},

\displaystyle\mathbf{\Sigma}_{i}=\left[{\begin{array}[]{cc}\sigma_{x,i}^{2}&\sigma_{xy,i}^{2}\\ \sigma_{xy,i}^{2}&\sigma_{y,i}^{2}\\ \end{array}}\right].

\displaystyle\mathbf{\Sigma}_{i}=\left[{\begin{array}[]{cc}\sigma_{x,i}^{2}&\sigma_{xy,i}^{2}\\ \sigma_{xy,i}^{2}&\sigma_{y,i}^{2}\\ \end{array}}\right].

p (x, y ∣ x_{i}, y_{i}, σ_{x, i}, σ_{y, i})

p (x, y ∣ x_{i}, y_{i}, σ_{x, i}, σ_{y, i})

p (x, y ∣ a, b, σ_{p})

p (x, y ∣ a, b, σ_{p})

p (x_{i}, y_{i} ∣ a, b, σ_{x, i}, σ_{y, i}, σ_{p})

p (x_{i}, y_{i} ∣ a, b, σ_{x, i}, σ_{y, i}, σ_{p})

p (X, Y ∣ Σ, a, b, σ_{p})

p (X, Y ∣ Σ, a, b, σ_{p})

p (a, b ∣ X, Y, Σ)

p (a, b ∣ X, Y, Σ)

p (a, b) d a d b = \frac{d θ}{π} \frac{d b _{⊥}}{2 B _{⊥}} = \frac{1}{( 1 + a ^{2} ) ^{3/2}} \frac{d a d b}{2 B _{⊥} π}

p (a, b) d a d b = \frac{d θ}{π} \frac{d b _{⊥}}{2 B _{⊥}} = \frac{1}{( 1 + a ^{2} ) ^{3/2}} \frac{d a d b}{2 B _{⊥} π}

ln L

ln L

=

p (a, b, σ_{⊥} ∣ {x_{i}}, {y_{i}})

p (a, b, σ_{⊥} ∣ {x_{i}}, {y_{i}})

=

p (a, b ∣ {x_{i}}, {y_{i}})

p (a, b ∣ {x_{i}}, {y_{i}})

p (M ∣ D) = \frac{p ( D ∣ M ) p ( M )}{p ( D )} .

p (M ∣ D) = \frac{p ( D ∣ M ) p ( M )}{p ( D )} .

\frac{p ( M _{2} ∣ D )}{p ( M _{1} ∣ D )}

\frac{p ( M _{2} ∣ D )}{p ( M _{1} ∣ D )}

p (θ ∣ D, M) = \frac{p ( D ∣ θ , M ) p ( θ ∣ M )}{p ( D ∣ M )} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sanjibs/bmcmc
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\jvol

55 \jyear2017

Markov Chain Monte Carlo Methods for Bayesian Data Analysis in Astronomy

Sanjib Sharma1

Draft version. To appear in Annual Review of Astronomy and Astrophysics.

1Sydney Institute for Astronomy, School of Physics, University of Sydney, NSW 2006, Australia, email: [email protected]

Abstract

Markov Chain Monte Carlo based Bayesian data analysis has now become the method of choice for analyzing and interpreting data in almost all disciplines of science. In astronomy, over the last decade, we have also seen a steady increase in the number of papers that employ Monte Carlo based Bayesian analysis. New, efficient Monte Carlo based methods are continuously being developed and explored. In this review, we first explain the basics of Bayesian theory and discuss how to set up data analysis problems within this framework. Next, we provide an overview of various Monte Carlo based methods for performing Bayesian data analysis. Finally, we discuss advanced ideas that enable us to tackle complex problems and thus hold great promise for the future. We also distribute downloadable computer software (https://github.com/sanjibs/bmcmc/) that implements some of the algorithms and examples discussed here.

doi:

10.1146/((please add article doi))

keywords:

Methods: data analysis, numerical statistical

††journal: Annu. Rev. Astron. Astrophys.

1 Introduction
1.1 Rise of MCMC based Bayesian methods in astronomy and science
2 Bayesian Data Analysis
2.1 Bayes’ Theorem
2.2 Fitting a model to data
2.3 Priors
2.4 Fitting a straight line
2.5 Model comparison
2.5.1 Bayesian model comparison
2.5.2 Predictive methods for Model comparison
3 Monte Carlo methods for Bayesian computations
3.1 Markov Chain
3.2 Metropolis Hastings algorithm
3.3 Gibbs sampling
3.4 Metropolis within Gibbs
3.5 Adaptive Metropolis
3.6 Affine invariant sampling
3.7 Convergence Diagnostics
3.7.1 Effective sample size
3.7.2 Variance between chains
3.7.3 Thinning
3.8 Parallel Tempering
3.9 Monte Carlo Metropolis Hastings
3.9.1 Unknown normalization constant
3.9.2 Marginal inference
3.10 Hamiltonian Monte Carlo
3.11 Population Monte Carlo
3.12 Nested Sampling
4 Bayesian hierarchical modelling (BHM)
4.1 Expectation maximization, data augmentation and Gibbs sampling
4.2 Handling uncertainties in observed data
5 Case studies in astronomy
5.1 Exoplanets and binary systems using radial velocity measurements
5.2 Data driven approach to estimation of stellar parameters from a spectrum
5.3 Solar-like oscillations in stars
5.4 Extinction mapping and estimation of intrinsic stellar properties
5.5 Kinematic and dynamical modelling of the Milky Way
6 Concluding remarks

1 Introduction

Markov Chain Monte Carlo (MCMC) and Bayesian Statistics are two independent disciplines, the former being a method to sample from a distribution while the latter is a theory to interpret observed data. When these two disciplines are combined together, the effect is so dramatic and powerful that it has revolutionized data analysis in almost all disciplines of science, and astronomy is no exception. This review explores the power of this combination.

What is so special about MCMC based Bayesian data analysis? The usefulness of Bayesian methods in science and astronomy is easy to understand. In many situations, it is easy to predict the outcome given a cause. But in science, most often, we are faced with the opposite question. Given the outcome of an experiment what are the causes, or what is the probability of a cause as compared to some other cause? If we have some prior information, how does that help us? This opposite problem is more difficult to solve. The power of Bayesian theory lies in the fact that it provides a unified framework to quantitatively answer such questions. Hence it has become indispensable for science. As opposed to deductive logic, Bayesian theory provides a framework for plausible reasoning, a concept which is more powerful and general, an idea championed by Jaynes (2003) in his book.

The question now is how does one solve a problem that has been set up using Bayesian theory. This mostly involves computing the probability distribution function (pdf) of some parameters given the data and is written as $p(\theta|D)$ . Here, $\theta$ need not be a single parameter; in general, it represents a set of parameters. Usually here and elsewhere, such functions do not have analytical solutions and so we need methods to numerically evaluate the distribution. This is where MCMC methods come to the rescue. They provide an efficient and easy way to sample points from any given distribution which is analogous to evaluating the distribution.

Bayesian data analysis (Jeffreys, 1939) and Markov Chain Monte Carlo (Metropolis et al., 1953) techniques have existed for more than 50 years. Their tremendous increase in popularity over the last decade is due to an increase in computational power which has made it affordable to do such computations.

The simplest and the most widely used MCMC algorithm is the “random walk” Metropolis algorithm (Section 3.2). However, the efficiency of this algorithm depends upon the “proposal distribution” which the user has to supply. This means that there is some problem-specific fine tuning to be done by the user. The problem to find a suitable proposal distribution becomes worse as the dimensionality of the space over which the sampling is done increases. Correlations and degeneracies between the coordinates further exacerbate the problem. Many algorithms have been proposed to solve this problem and it remains an active area of research. Some algorithms work better for specific problems and under special conditions but algorithms that work well in general are in high demand. Multimodal distributions pose additional problems for MCMC algorithms. In such situations, an MCMC chain can easily get stuck at a local density maximum. To overcome this, algorithms like simulated tempering and parallel tempering have been proposed (Section 3.8). Hence discussion of efficient MCMC algorithms is one focus of this review.

Given its general applicability, the Bayesian framework can be used in almost any field of astronomy. Hence, it is not possible to discuss all its applications. However, there are many examples where either alternatives do not exist or are inferior. The aim of this review is to specifically discuss such cases where Bayesian-MCMC methods have enjoyed great success. The Bayesian framework by itself is very simple. The difficult part when attempting to solve a problem is to express the problem within this framework and then to choose the appropriate MCMC method to solve it. The best way to master this is by studying a diverse set of applications, and we aim to provide this in our review (Section 5). Finally, we also discuss a few advanced topics like non-parametric models and hierarchical Bayesian models (Section 4) which are not yet main stream in astronomy but are very powerful and allow one to solve complex problems.

To summarize, the review has three main aims. The first is to explain the basics of Bayesian theory using simple familiar problems, e.g., fitting a straight line to a set of data points with errors in both coordinates and in the presence of outliers. This is targeted at readers who are new to the topic. The second goal is to provide a concise overview of recent developments. This will benefit people who are familiar with Bayesian data analysis but are interested in learning more. The final aim is to discuss emerging ideas that hold great promise in future. We also develop and distribute downloadable software (available at https://github.com/sanjibs/bmcmc/ or by running the command pip install bmcmc) implementing some of the algorithms and examples that we discuss.

1.1 Rise of MCMC based Bayesian methods in astronomy and science

The emergence of Bayesian statistics has a long and interesting history dating back to 1763 when Thomas Bayes laid down the basic ideas of his new probability theory (Bayes & Price, 1763, published posthumously by Richard Price). It was rediscovered independently by Laplace (de Laplace, 1774) and used in a wide variety of contexts, e.g., celestial mechanics, population statistics, reliability, and jurisprudence. However, after that it was largely ignored. A few scientists like, Bruno de Finetti and Harold Jeffreys kept the Bayesian theory alive in the first half of the 20th century. Harold Jeffreys published the book Theory of Probability (Jeffreys, 1939), which for a long time remained the main reference for using the Bayes theorem. The Bayes theorem was used in the Second World War at Bletchley Park, United Kingdom, for cracking the German Enigma code, but its use remained classified for many years afterwards. From 1950 onwards, the tide turned towards Bayesian methods. However, the lack of proper tools to do Bayesian inference remained a challenge. The frequentist methods in comparison were simpler to implement which made them more popular. Recent statement by the American Statistical Association, (Wasserstein & Lazar, 2016) warning on the misuse of P values is another example of the superiority of the Bayesian methods of hypothesis testing.

Interestingly, efficient methods like MCMC to sample distributions had been invented by 1954 in the context of solving problems in statistical mechanics (Metropolis et al., 1953). (The brand name Monte Carlo was coined by Metropolis & Ulam (1949) where they discussed a stochastic method making use of random numbers to solve a class of problems in mathematical physics which are difficult to solve due to the large number of dimensions.) Such problems typically involve $N$ interacting particles. A single configuration $\omega$ of such a system is fully specified by giving the position and velocity of all the particles; i.e., $\omega$ can be defined by a point in $\mathcal{R}^{2N}$ space, also known as the configuration space $\Omega$ . The total energy is a function of the configuration $E(\omega)$ . For a system in equilibrium, the probability of a configuration is given by $\exp(-E(\omega)/kT)$ , where $k$ is the Boltzmann constant and $T$ is the temperature of the system. Computing any thermodynamic property of the system, e.g., pressure or energy typically involves computing integrals of the form

[TABLE]

for which $Z=\int\exp(-E(\omega)/kT)d\omega$ is known as the partition function. The integrals over $\omega$ are in most cases analytically and computationally intractable. The idea of Metropolis and colleagues was to start with an arbitrary configuration of $N$ particles and then move each particle by a random walk. If $\Delta E<0$ , the move is always accepted, otherwise, it is accepted stochastically with probability $\exp(-\Delta E/kT)$ , which is the ratio of the probability of the new configuration with respect to the old. The method ends up choosing a configuration $\omega$ sampled from $\exp(-E(\omega)/kT)$ . The method immediately became popular in the statistical physics community.

However, the fact that the same method can be used for sampling an arbitrary pdf $p(\omega)$ by simply replacing $E(\omega)/kT$ with $\ln(p(\omega))$ had to wait till the important paper by Hastings (1970). He generalized the work of Metropolis and colleagues and derived the essential condition for the acceptance ratio that a Markov chain ought to satisfy in order to sample the target distribution. The generalized algorithm is now known as the Metropolis-Hastings (MH) algorithm. Later Hastings’s student Peskun showed that, among the available choices, the one by Metropolis and colleagues was the most efficient (Peskun, 1973). Despite its introduction to the statistical community, the ideas remained dormant till 1980.

Around 1980 things suddenly changed and a few influential algorithms appeared. Simulated annealing was presented by Kirkpatrick et al. (1983) to solve combinatorial optimization problems using the MH algorithm in conjunction with ideas of annealing from solid state physics. It is especially useful for situations where we have multiple maxima and applies to any setting where we have to minimize an objective function $C(\omega)$ . This is done by sampling $\exp(-C(\omega)/T)$ , with progressively decreasing $T$ to allow annealing and selection of a globally optimum solution. A year later Geman & Geman (1984) introduced what we currently know as “Gibbs sampling” in the context of image restoration. This was the first proper use of MCMC techniques to solve a problem set up in a Bayesian framework, in the sense that simulating from conditional distributions is the same as simulating from the joint distribution. However, there exists earlier work related to Gibbs sampling; the Hammersley-Clifford theorem which was developed in the early 1970s and the work by Besag (1974).

At about this time, one of the most influential methods of the 20th-century emerged $-$ the expectation maximization (EM) algorithm by Dempster, Laird & Rubin (1977). This provided a way to deal with missing data and hidden variables and vastly increased the range of problems that can be addressed by Bayesian methods. The EM algorithm is deterministic and has some sensitivity to the starting configuration. To address this, stochastic versions were developed (Celeux & Diebolt, 1985) quickly followed by the data augmentation (DA) algorithm (Tanner & Wong, 1987).

The watershed moment in the field of statistics is largely credited to the paper by Gelfand & Smith (1990) that unified the ideas of Gibbs sampling, DA and the EM algorithm (Tanner & Wong, 2010; Robert & Casella, 2011). It firmly established that Gibbs sampling and Metropolis-Hastings based MCMC algorithms can be used to solve a wide class of problems that fall into the category of hierarchical Bayesian models. The citation history of the famous Metropolis et al. (1953) paper shown in Figure 1 corroborates the historical narrations on this topic. In physics, the MH algorithm was well known in the period 1970-1990, but this was not so in statistics or astronomy. In astronomy, a watershed moment can be seen in 2002; this is visible more clearly in Figure 2 where we track the usage of the words MCMC and Bayesian.

But prior to 2002, the Bayesian-MCMC technique was not unknown to the astronomy community. We can see its use in Saha & Williams (1994) who applied it to extract galaxy kinematics from absorption line spectra. Further seeds were planted down the line by Christensen & Meyer (1998) while studying gravitational wave radiation, and then by Christensen et al. (2001) and Knox, Christensen & Skordis (2001) in the context of cosmological parameter estimation using cosmic microwave background data. Inspired by these papers, Lewis & Bridle (2002) more than any other paper seems to have galvanized the astronomy community in the use of Bayesian and MCMC techniques. They laid out in detail the Bayesian-MCMC framework, applied it to one of the most important data sets of the time (cosmic background radiation) and used it to address a significant scientific question $-$ the fundamental parameters of our universe. Additionally, they made their MCMC code publicly available, which was instrumental in lowering the barrier for new entrants to the field.

2 Bayesian Data Analysis

In this section we briefly review the basics of the Bayesian theory. We start with the Bayes theorem and then use it to set up the problem of fitting a model to data. This is followed by a discussion of the role of priors in Bayesian analysis. Next, the Bayesian solution of fitting a straight line is discussed in detail to illustrate the ideas discussed. Finally, we show how to perform model selection. To further explore the topics discussed here, many excellent resources are available. A stimulating discussion on Bayesian theory can be found in Jaynes (2003). Sivia & Skilling (2006) and Gregory (2005) are excellent textbooks with a good emphasis on applications in science. Hogg, Bovy & Lang (2010) provides lucid tutorial on fitting models to data. A fascinating discussion on Bayesian versus frequentist approaches to solving problems can be found in Loredo (1990). A review with emphasis on cosmology is given by Trotta (2008).

2.1 Bayes’ Theorem

Cox (1946) showed that the rules of Bayesian probability theory can be derived from just two basic rules:

[TABLE]

Here $H$ stands for some proposition being true and $D$ stands for some other proposition being true, and $\bar{H}$ means the proposition $H$ is false. So the sum rule just states that the probability of a proposition being true plus the probability of it being false is unity. The product rule expresses the joint probability of two propositions being true in terms of conditional probabilities, one being true given the other is true. The vertical bar $|$ is a conditioning symbol and means ‘given’. $I$ denotes relevant background information that is used to construct the probabilities. The product rule leads to the Bayes Theorem

[TABLE]

where we identify $H$ with the hypothesis and $D$ with the data. The $p(D|H,I)$ is the probability of observing the data $D$ if the hypothesis is true and is known as the likelihood. The quantity $p(H|I)$ is the prior and specifies our prior knowledge of $H$ being true. The $p(H|D,I)$ , known as posterior, expresses our updated belief about the truth of the hypothesis in light of the data $D$ . The quantity $p(D|I)$ is a constant and serves the purpose of normalizing $\int p(H|D,I)\>{\rm d}H$ to 1. It is known as the evidence.

Another important result that can be derived from the sum rule and the product rule is the marginalization equation,

[TABLE]

First let us write the sum rule in an alternate form. Instead of considering just $Y$ and $\bar{Y}$ , we consider a set of possibilities $\{Y_{i}\}$ that are mutually exclusive.

[TABLE]

Now, making use of the product rule and the sum rule we get

[TABLE]

2.2 Fitting a model to data

Typically, we have some data and we want to use it for scientific inference. One of the most effective approaches to dealing with such problems is to develop a model that describes how the data were created. Let $\theta$ be the set of parameters of the model and $x^{t}$ a data point generated by the model according to $f(x^{t}|\theta)$ . The observed data points $x$ can have some measurement errors, described by a parameter $\sigma_{x}$ . The probability of the observed value is then given by $p(x|x^{t},\sigma_{x})$ , which could be $\mathcal{N}(x|x^{t},\sigma_{x}^{2})$ for Gaussian errors; hereafter, $\mathcal{N}(.|\mu,\sigma^{2})$ refers to a normal distribution with mean $\mu$ and variance $\sigma^{2}$ . The probability of observed data point $x$ given a model and an error is then

[TABLE]

We have integrated over true values $x^{t}$ which are unknown.

If we have reason to believe that there are outliers in the data, e.g., a fraction of points are not described by the model, we can supplement a background model $f_{b}(x^{t}|\theta_{b})$ with probability $P_{b}$ and parameters $\theta_{b}$ (Press, 1997; Hogg, Bovy & Lang, 2010). The probability of the observed data points can then be written as,

[TABLE]

The total probability for a set of $N$ data points $X=\{x_{1},...,x_{N}\}$ is then

[TABLE]

To infer the model parameters, one uses the Bayes theorem and computes

[TABLE]

Here, $p(\theta,\theta_{b},P_{b})$ represents our prior knowledge about the parameters. We discuss this in detail in the next section.

We consider the problem of fitting a straight line with equation $y=mx+c$ to some data points $X=\{x_{1},...,N\}$ and $Y=\{y_{1},...,N\}$ , with uncertainty $\sigma_{y,i}$ on the $y$ ordinate. We generated 50 data points with $m=2.0$ and $c=10.0$ ; 20% of the data points were set as outliers and were sampled from $\mathcal{N}(30,5^{2})$ . To simulate random uncertainty the $y$ ordinate was scattered with a Gaussian function having dispersion in range $0.25<\sigma_{y}<1.25$ . The data along with the results of our fitting exercise are shown in Figure 3. The image shows the outliers and data sampled from a straight line. We first fitted a simple model without taking the outliers into account (dashed line). Here, $\theta=\{m,b\}$ and the generative model of the data is

[TABLE]

It can be seen that the “best fit” line is not a good description for the data points that were sampled from a straight line. Next, we extended the model by adding a model for the outliers as

[TABLE]

The full model being

[TABLE]

The best-fit line resulting from this model obtained by sampling the posterior distribution using a Markov Chain Monte Carlo scheme is shown in Figure 3. The best-fit parameters of the model resemble well the true parameters that were used to create the synthetic data set (the example is implemented in the software that we provide).

2.3 Priors

Priors are one of the most important ingredients of the Bayesian framework. Priors express our present state of knowledge about the parameters of interest, which we wish to constrain by analyzing new data. In a multi-dimensional parameter space, it is quite common to have degeneracies among the parameters. Here priors can play a crucial rule in restricting the posterior to a small region of the parameter space as compared to the much larger region allowed by the likelihood function. Priors can be broadly classified into two types, uninformative and informative. Uninformative priors express our state of ignorance and have very little restricting power. They are also known as ignorance prior. Typically their distributions are diffuse. Informative priors on the other hand By contrast, informative priors are typically very restricting. They might come from the analysis of some previous data. They are important when the data alone are not very informative and without external information the data cannot adequately constrain the parameters being investigated.

Ignorance priors are used in cases where we have very little knowledge about the parameters we want to constrain, and we wish to express our ignorance by using uninformative priors. Certainly a prior with sudden jumps or oscillating features is too detailed for expressing ignorance! So smoothness is certainly an important criterion for an ideal uninformative prior. In fact, if the data are informative, almost any prior that is sufficiently smooth in the region of high likelihood will lead to very similar conclusions. Is there a formal and unique way to express our ignorance?

A number of techniques exist for constructing ignorance priors. We here discuss a few simple and commonly used ones; for a detailed review see Kass & Wasserman (1996). The simplest is Laplace’s principle of insufficient reason which assigns equal probability to all possible values of the parameter. If the parameter space consists of a finite set of points, then it is easy to apply this principle. However, for a continuous parameter space, the prior depends upon the chosen partitioning scheme.

Ignorance priors can also be specified using the invariance of the likelihood function, $p(x^{\prime}|\theta^{\prime}){\rm d}x^{\prime}=p(x|\theta){\rm d}x$ , under the action of a transformation group $(x^{\prime},\theta^{\prime})=h(x,\theta)$ , e.g., translation, scaling or rotation of coordinates. If the priors are really uninformative, consistency demands that we should make the same Bayesian inference, which implies that the priors should also be invariant to the transformation and satisfy $p(\theta^{\prime}){\rm d}\theta^{\prime}=p(\theta){\rm d}\theta$ (Jaynes, 2003). For two special types of parameters, this leads to unique choices for expressing ignorance. These are the location parameters and the scale parameters. An example is the mean $\mu$ and dispersion $\sigma$ of a normal distribution $\mathcal{N}(x|\mu,\sigma^{2})$ which are the location and the scale parameters respectively. The likelihood $\mathcal{N}(x|\mu,\sigma^{2})$ is invariant under transformation $(x^{\prime},\mu^{\prime})=(x+b,\mu+b)$ , demanding invariance for the prior leads to $p(\mu)={\rm constant}$ . Similarly, $\mathcal{N}(x|\mu,\sigma^{2})$ is also invariant under $(x^{\prime}-\mu^{\prime},\sigma^{\prime})=(a(x-\mu),a\sigma)$ , which leads to $p(\sigma)\propto 1/\sigma$ . In general, $\mu$ and $\sigma$ are location and scale parameters if likelihood is of the form $f((x-\mu)/\sigma)/\sigma$ .

Another commonly used technique to specify ignorance priors is the Jeffreys rule,

[TABLE]

is the Fisher information matrix and $\theta$ a vector of parameters. It is based on the idea that the prior should be invariant to reparameterization of $\theta$ . Applying it to the case where the likelihood is a normal distribution $\mathcal{N}(x|\mu,\sigma^{2})$ , gives $p(\mu)={\rm constant}$ (for a fixed $\sigma$ ) and $p(\sigma)=1/\sigma$ (for a fixed $\mu$ ). However, when applied to both $\mu$ and $\sigma$ together, it gives $p(\mu,\sigma)=1/\sigma^{2}$ . To avoid this contradiction the rule was modified to

[TABLE]

where $\mu_{i}$ are location parameters and $\mathcal{I}(\theta)$ is calculated keeping them fixed.

The principle of maximum entropy (Jaynes, 1957) is also helpful for selecting priors. Suppose we are interested in knowing the pdf of a variable, e.g., the probability of a given face of a six-faced die landing up. Suppose we also have some macroscopic constraint available to us, e.g., the mean value obtained when the die is rolled a large number of times. Such a constraint cannot uniquely identify a pdf but can be used to rule out a number of pdfs. The principle says that out of all possible pdfs satisfying the constraint, the most likely one is the one having maximum entropy, where the entropy is defined as $S=-\sum p_{i}\log p_{i}$ . We now use this principle to derive the most likely distribution of a variable for two common cases.

•

If for a variable $x$ we know the expectation value $\bar{x}$ and the fact that it lies in the range $[0,\infty]$ then the maximum entropy distribution of $x$ is $p(x|\bar{x})=\exp(-x/\bar{x})/\bar{x}.$

•

If $\bar{x}$ and variance $\sigma^{2}=\langle(x-\bar{x})^{2}\rangle$ are known, then $p(x|\bar{x},\sigma)=\frac{1}{\sigma\sqrt{2\pi}}\exp\left(-\frac{(x-\bar{x})^{2}}{2\sigma^{2}}\right).$

2.4 Fitting a straight line

We now consider the Bayesian solution for fitting a straight line in detail (see also Jaynes, 1991; Hogg, Bovy & Lang, 2010). We first discuss the solution for the general case where we have uncertainties on both $x$ and $y$ coordinates and then discuss the case where the uncertainties are unknown. Suppose we have a collection of points $(X=\{x_{1},...,x_{N}\}$ , $Y=\{y_{1},...,x_{N}\})$ , with uncertainties $\mathbf{\Sigma}=\{\mathbf{\Sigma}_{1},...,\mathbf{\Sigma}_{N}\}$ . Here $\mathbf{\Sigma}_{i}$ is the covariance matrix defined as

[TABLE]

We want to fit a line $y=ax+b$ to these data. For the time being, we assume $\mathbf{\Sigma}_{i}$ to be a diagonal matrix with $\sigma_{xy,i}=0$ . Let $(x,y)$ be the true values corresponding to the point $(x_{i},y_{i})$ . Then the probability of measuring the point $(x,y)$ at $(x_{i},y_{i})$ is

[TABLE]

Let us consider a generative model for the line. We consider the pdf of a line to be described by a Gaussian with width $\sigma_{p}$ along a direction perpendicular to the line and width $\sigma_{h}$ along the line. Here, $\sigma_{p}$ can be thought of as an intrinsic scatter about the linear relation that we wish to investigate. In the limit $\sigma_{h}\to\infty$ , the probability of a point $(x,y)$ to be sampled from this generative model is

[TABLE]

Hence, the probability of $(x_{i},y_{i})$ being sampled from the generative model of the line is

[TABLE]

where $\sigma_{\perp,i}=\sqrt{(\sigma_{y,i}^{2}+a^{2}\sigma_{x,i}^{2})/(1+a^{2})}$ is the component of the error vector perpendicular to the line, and $d_{i}=(y_{i}-ax_{i}-b)/\sqrt{1+a^{2}}$ is the perpendicular distance of the point from the line. For a general matrix $\mathbf{\Sigma}_{i}$ , $\sigma_{\perp,i}=\hat{\mathbf{u}}^{T}\mathbf{\Sigma}_{i}\hat{\mathbf{u}}$ for which $\hat{\mathbf{u}}=(-a/\sqrt{1+a^{2}},1/\sqrt{1+a^{2}})$ is a unit vector perpendicular to the line. For the full sample,

[TABLE]

If we desire to compute $a$ and $b$ , then

[TABLE]

Henceforth, $A$ is a normalization constant which may be different in different equations. The $p(a,b)$ is the prior distribution of parameters of the line. The two common choices for the prior are the uniform (flat) and Jeffreys prior. Neither is appropriate. Given the rotational symmetry in the problem, a sensible choice is to have priors that are symmetric with respect to rotation. Let $\theta=\tan^{-1}a$ be the angle made by the line with $x$ axis, and $b_{\perp}=b\cos(\theta)$ be the distance of the line from the origin. A uniform prior on $\theta$ and $b_{\perp}$ is symmetric with respect to rotation. This leads to

[TABLE]

In Figure 4, we graphically show how a prior uniform in $a$ differs from a prior uniform in $\theta=\tan^{-1}(a)$ . In the left panel, we show straight lines uniformly spaced in $a$ . The lines tend to crowd at high value of $a$ , and this can bias the estimate of the slope $a$ . In the right panel, the lines are uniformly spaced in $\theta$ , and there is no crowding effect.

The log-likelihood of the full solution after taking the prior into account is

[TABLE]

We now study the case where $\mathbf{\Sigma_{i}}$ is unknown and $\sigma_{p}=0$ . For simplicity, we assume the uncertainty is the same for all data points, i.e., $\sigma_{\perp,i}=\sigma_{\perp}$ .

[TABLE]

and integrate over $\sigma_{\perp}$ using Jeffreys prior $p(\sigma_{\perp}|a,b)=1/\sigma_{\perp}$ to arrive at

[TABLE]

So, if we ignore the prior factor, the best fit line is simply the line that minimizes the sum of the squared perpendicular distances of points from the line.

2.5 Model comparison

When we have multiple models to explain data, we are faced with the question of which model is better. There is no unique definition of better and depending upon what we mean by better we can come up with different criteria to compare models. We have two main schools of thought, a) to compare the probability of the model given the data and b) to compare the expected predictive accuracy of the model for the future data. The former is inherently a Bayesian approach and is known as Bayesian model comparison. The latter is inspired by frequentist ideas but can also be argued from a Bayesian perspective (Vehtari & Ojanen, 2012; Gelman, Hwang & Vehtari, 2014).

2.5.1 Bayesian model comparison

In the Bayesian formulation, the usefulness of a model is indicated by the probability of a model $M$ given the data $D$ ,

[TABLE]

The prior model probability $p(M)$ is generally assumed to be unity. Note in some cases it may not be so, and we might have more reason to believe one model over the other. The $p(D)$ is the same for all models, so it is irrelevant when comparing models. Thus the main thing we need to compute is the evidence $p(D|M)$ (also know as marginal likelihood). Hence, for two models $M_{1}$ and $M_{2}$ , the odds ratio in favor of $M_{2}$ compared to $M_{1}$ is mainly determined by the ratio of their evidences, $B_{21}$ , also known as the “Bayes factor” (for a review and a guide to interpreting the Bayes factor, see Kass & Raftery, 1995).

[TABLE]

For some given data $D$ and a model $M$ parameterized by $\theta$ , we have

[TABLE]

The evidence $p(D|M)$ appears as the denominator on the right hand side and can be obtained by integrating both sides of Equation (36) over all $\theta$ . For properly normalized quantities, the left hand side integrates to unity, leading to $p(D|M)=\int p(D|\theta,M)p(\theta|M){\rm d}\theta$ .

Note, the Bayes factor depends upon the adopted range of the prior which leads to some conceptual difficulties (see the paradox in Lindley, 1957). The range of prior is not an issue for parameter estimation but it is for model selection; we cannot use improper priors. In most cases, we do have a reasonable sense of the range of priors and they are unlikely to extend to infinity. To better understand the role of priors, consider two models $M_{1}$ and $M_{2}$ , where $M_{2}$ has a free parameter $\theta$ , while $M_{1}$ has no free parameter (with $\theta$ being fixed to $\theta_{0}$ ). Let $\Delta\theta_{\rm likelihood}$ be the characteristic width of the likelihood distribution and $\Delta\theta_{\rm prior}$ the range of a uniform prior which encloses the likelihood peak. The Bayes factor in favor of model $M_{2}$ as compared to $M_{1}$ is then

[TABLE]

The first term on the right hand side will in general be greater than one and will favor $M_{2}$ , as the simpler model $M_{1}$ is a special case of $M_{2}$ . However, the second term penalizes $M_{2}$ if it has a large range in priors.

The conceptual difficulty associated with the dependence of the Bayes factor on the adopted prior range is alleviated if one thinks of hypothesis as a specification of a model as well as the prior on its parameters. A model $M_{2}$ with a larger range in priors allows for a larger number of possible data sets consistent with the hypothesis as compared to a simpler model $M_{1}$ with narrow range of prior. Hence $p(D|M_{2})$ , being a normalized probability over possible data sets, will be lower as compared to $p(D|M_{1})$ (MacKay, 2003). Also, $M_{1}$ is more precise as a hypothesis as compared with $M_{2}$ .

If we have more free parameters in a model, the penalty term in the Bayes factor will be higher, being of the form $\prod_{i=1}^{d}\Delta\theta_{\rm likelihood,i}/\Delta\theta_{\rm prior,i}$ . In this sense, the Bayes factor has a built-in safeguard to prevent overfitting (a model with a large number of free parameters will fit a given set of data better but will perform poorly when presented with new data).

{marginnote}\entry

BICBayesian information criterion \entryWBICWidely applicable Bayesian information criterion

Computing the Bayes factor or the Bayesian evidence is computationally challenging. Generally, the likelihood is peaked and confined to a narrow region in the prior range, but has long tails whose contributions cannot be neglected. Some commonly employed numerical techniques are (1) simulated annealing, (2) nested sampling, (3) Laplace’s approximation, (4) Lebesgue integration theory (Weinberg, 2012), and (5) the Savage-Dickey density ratio (Verdinelli & Wasserman, 1995). Two useful approximations of the Bayes free energy $\mathcal{F}=-\ln p(D|M)$ are

[TABLE]

Here, ${\rm E}_{\theta}^{\beta}$ denotes expectation taken over the posterior distribution $p(\theta|Y)\propto p(Y|\theta)^{\beta}p(\theta)$ of $\theta$ . The case of $\beta=1$ corresponds to the Bayesian estimation of the posterior. The posterior can be sampled using an MCMC algorithm Assuming weak priors and that the posterior is asymptotically normal we have $\mathcal{F}=\mathrm{BIC}+O(1)$ . WBIC is an improved version of BIC, which is also applicable for singular statistical models where BIC fails. A model is singular if the Fisher information matrix is not positive definite, which typically occurs when the model contains hierarchical layers or has hidden variables.

2.5.2 Predictive methods for Model comparison

A statistical model $p(x|\theta)$ can be thought of as an approximation of the true distribution $q(x)$ from which the observed data $Y=\{y_{1},y_{2},...,y_{n}\}$ were generated. $Y$ represents a set of independently observed data points such that $p(Y|\theta)=\prod_{i=1}^{n}p(y_{i}|\theta)$ . The Bayesian predictive distribution can then be defined as $p(x|Y)={\rm E}_{\theta}[p(x|\theta)]$ , while the maximum likelihood estimate is given by $p(x|\hat{\theta}(Y))$ . Predictive methods judge models by their ability to fit future data $X=\{x_{1},x_{2},...,x_{n}\}$ , e.g., via the log-likelihood function $-\ln p(X|Y)$ . Given that we do not have future data, the idea is to measure out-of-sample-prediction error from the sample at hand. Cross validation is a natural way to do this, where we divide the current data set into training and testing samples. But this is computationally costly. Hence, alternate criteria have been developed. We start by computing the training error $T_{e}=-\frac{1}{n}\sum_{i=1}^{n}\ln p(y_{i}|Y)$ . However, this is a biased estimator of $\mathbb{E}_{x}[-\ln p(x|Y)]$ as the data are used twice, once to estimate the model and once more to compute the log likelihood of the data. If we have more parameters in the model, it will certainly fit the given data better but will also give rise to larger variance in the estimator, and we need to penalize the model for this. This variance, which represents the effective degrees of freedom in the model, can be calculated from the data and the model. A list of some useful information criteria based on the above idea are given below. They can be easily computed using samples of $\theta$ obtained by an MCMC simulation of the posterior $p(\theta|Y)$ . {marginnote} \entryAIC Akaike information criterion \entryDICDeviance information criterion \entryWAICWidely applicable Bayesian information criterion

[TABLE]

Here, ${\rm Var}_{\theta}^{1}$ denotes variance taken over the posterior distribution $p(Y|\theta)p(\theta)$ of $\theta$ . The first term is a measure of how well the model fits the observed data while the second term is a penalty for the degrees of freedom $d$ in the model.

In general, the predictive criteria have a well-defined information-theoretic interpretation (Burnham & Anderson, 2002; Watanabe, 2010). Specifically, the expected value of AIC and WAIC, is equivalent to the expected Kullback-Leibler divergence $\int q(x)\ln[(q(x)/p(x|Y)]dx$ of the predictive distribution from the true distribution, the expectation is taken over the random realizations of the observed data set $Y$ , which samples the true distribution $q(x)$ . Also, in the asymptotic limit of large sample size, both AIC and WAIC are equivalent to leave-one-out cross-validation (LOOCV).

An extra parameter in a model need not necessarily contribute to extra variance in the predictive density, e.g., if we have informative priors on the parameter, the likelihood has a very weak dependence on the parameter or if the model is hierarchical then multiple parameters might be restricted. The use of AIC can be problematic in such cases. DIC and WAIC overcome this problem by estimating the effective degrees of freedom directly from the likelihood function of the data and samples of $\theta$ obtained from the posterior $p(\theta|Y)$ .

WAIC offers some additional advantages as compared to AIC and DIC. AIC and DIC use a point estimate for $\theta$ when computing predictive density, whereas WAIC uses the Bayesian predictive density. If a model is singular, criteria such as AIC, DIC and BIC do not work well. In contrast, WAIC works for such cases, and in the asymptotic limit of large sample size, WAIC is always equivalent to Bayesian LOOCV.

It is instructive to study the differences between BIC and AIC, as they represent two very different approaches to the problem of model selection (for a detailed discussion, see Burnham & Anderson, 2002). Due to the presence of the $\ln n$ term, for $n>7$ the BIC penalizes free parameters more heavily as compared to AIC. So BIC is more parsimonious or cautious when it comes to admitting new parameters in a model. In situations where two models can give rise to the same predictive distribution, BIC will favor the model with fewer degrees of freedom while AIC will treat them equally. An example is nested models, where a simpler model can be considered as a special case of a complex model but with few of its parameters being fixed. Interestingly, AIC can also be argued to be using the approach of Bayes factors, but with a prior whose variance decreases with sample size $n$ , whereas BIC would correspond to the choice of a weak prior with fixed variance (Smith & Spiegelhalter, 1980).

To conclude, the Bayesian and the predictive methods both have their strengths and weaknesses. If the choice of priors is well justified, then the methods based on Bayes factor are best suited for model selection. However, if our aim is best predictive accuracy for future data, predictive methods like WAIC are a better choice.

3 Monte Carlo methods for Bayesian computations

Having discussed how to set up problems in the Bayesian framework, we now discuss methods to perform the inference, i.e., how to estimate the pdf of parameters given the data. Except for some simple cases, closed form analytical solutions are in general not available. So one makes use of Monte Carlo based methods to sample from the desired distribution. The most popular method to do this today is the Markov Chain Monte Carlo (MCMC) method. MCMC is a class of methods for sampling a pdf using a Markov chain whose equilibrium distribution is the desired distribution. Once we have a sample distributed according to some desired distribution, we can compute expectation values and integrals of various quantities in a process analogous to Monte Carlo integration. The word Monte Carlo in MCMC comes from the use of random numbers to drive the Markov process and the close analogy to Monte Carlo integration schemes. Note in conventional Monte Carlo integration, the random samples are statistically independent whereas in MCMC they are correlated. We first broach the theory behind Markov chains and then discuss specific MCMC methods based on it.

3.1 Markov Chain

A Markov chain is a sequence of random variables $X_{1},...,X_{n}$ such that, given the present state, the future and past are independent. It is formally written as

[TABLE]

In other words, the conditional distribution of $X_{n+1}$ in future, depends only upon the present state $X_{n}$ . If the probability of transition is independent of $n$ , it is a time-homogeneous chain. Such a chain is defined by specifying the probabilities of transitioning from one state to another. To simplify mathematical notation, we sometimes consider the state space to be continuous and sometimes discrete. But the presented results are equally valid for either type of spaces. For a continuous state space where a probability density can be defined we can write the transition probability as

[TABLE]

For a discrete state space the transition probability is a matrix and is written as $K_{xy}$ . On a given state space, a time-homogeneous Markov chain has a stationary distribution (invariant measure) $\pi$ if

[TABLE]

A Markov chain is irreducible if it can go from any state $x$ of a discrete state space to any other state $y$ in a finite number of steps, i.e., there exists an integer $n$ such that $K^{n}_{xy}>0$ . If a chain having a stationary distribution is irreducible, the stationary distribution is unique, and the chain is positive recurrent. For an aperiodic, positive recurrent chain with stationary distribution $\pi$ , the distribution is limiting (equilibrium distribution). This means if we start with any initial distribution $\lambda$ (a row vector specifying probability over states of a discrete state space) and apply the transition operator $K$ (a matrix) many times, the final distribution will approach the stationary distribution $\pi$ (a row vector),

[TABLE]

For an irreducible Markov chain with a unique stationary distribution $\pi$ , there is a law of large numbers which says that the expectation value of a function $g(x)$ over $\pi$ approaches the average taken over the output of a Markov chain,

[TABLE]

This property allows one to compute Monte Carlo estimates of specific quantities of interest from a Markov chain. Techniques that do this are known as Markov chain Monte Carlo or MCMC.

A chain having a stationary distribution is said to be reversible if the chain starting from a stationary distribution looks the same when run forward or backward in time. In other words, if $X_{n}$ has distribution $\pi$ then the pair $(X_{n},X_{n+1})$ has the same joint distribution as $(X_{n+1},X_{n})$ .

[TABLE]

For the transition kernel $K$ this means

[TABLE]

and is known as the condition of detailed balance. For a Markov chain, it is not necessary to satisfy reversibility in order to have a stationary distribution. However, reversibility guarantees the existence of a stationary distribution, and is thus a stronger condition. This is the reason that most MCMC algorithms are designed to satisfy detailed balance.

3.2 Metropolis Hastings algorithm

The most general MCMC algorithm is the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970). Suppose we are interested in sampling a distribution $f(x)$ on a state space $E$ , with $x\in E$ . To construct a transition kernel $K(x,y)$ to go from $x$ to $y$ , MH algorithm uses a two step process:

•

Specify a proposal distribution $q(y|x)$ .

•

Accept draws from $q(y|x)$ with acceptance ratio $\alpha(x,y)={\rm min}\left[1,\frac{f(y)q(x|y)}{f(x)q(y|x)}\right]$ .

So the transition kernel is given by $K(x,y)=q(y|x)\alpha(x,y)$ . The full algorithm is as follows

The transition kernel of the MH algorithm is reversible and satisfies detailed balance, $f(x)K(x,y)=f(y)K(y,x)$ . Note the reversibility condition by itself does not lead to a unique form for the acceptance ratio $\alpha(x,y)$ and alternatives exist (Barker, 1965). However, it has been shown that the acceptance ratio of the MH algorithm results in a chain with the fastest mixing rate (Peskun, 1973).

There are multiple ways to construct the proposal distribution $q$ each leading to a new version of the MH algorithm.

•

Symmetric Metropolis: $q(y|x)=q(x|y)$ which simplifies the acceptance probability to ${\rm min}\left\{1,f(y)/f(x)\right\}$ ; this is the version that was proposed by Metropolis and colleagues.

•

Random walk Metropolis-Hastings (RWMH): $q(y|x)=q(y-x)$ ; the direction and distance of the new point from the current point is independent of the current point. Common choices are $N(x,\sigma^{2})$ and ${\rm Uniform}(x-\sigma,x+\sigma)$ .

•

Independence sampler: $q(y|x)=q(y)$ ; i.e., the new state is drawn independent of the current state. The acceptance probability is given by ${\rm min}\left\{1,\frac{f(y)q(x)}{f(x)q(y)}\right\}$ , a generalization of the accept-reject algorithm. The quantity $q(x)$ should resemble $f(x)$ but with longer tails.

•

Langevin algorithm: $q(y|x)\sim N(x+\frac{\sigma^{2}}{2}\nabla\log f(x),\sigma^{2})$ ; this is useful when the gradient is available.

Except when $f(y)=f(x)$ (uniform target density), the mean of the acceptance ratio $\alpha$ is always less than unity. Decreasing $\sigma$ in the RWMH algorithm increases $\alpha$ but lowers the independence of the sampler. Increasing $\sigma$ improves the independence but lowers $\alpha$ . In the Langevin algorithm, one makes use of the information in the gradient to allow faster mixing of the chain.

3.3 Gibbs sampling

The Gibbs sampler introduced by Geman & Geman (1984) is one of the most popular computational methods for doing Bayesian computations. Suppose we want to sample $f(x)$ where $x\in\chi\subseteq\mathcal{R}^{d}$ . In Gibbs sampling, the transition kernel $K(x,y)$ is split into multiple steps. In each step, one coordinate is advanced based on its conditional density with respect to other coordinates. The algorithm is as follows:

The full transition kernel is written as,

[TABLE]

Similarly one can define a reverse move,

[TABLE]

It can be easily shown that

[TABLE]

Integrating both sides leads to

[TABLE]

Thus, $f$ is the stationary distribution of the Markov chain formed by the transition kernel $\kappa_{1\to d}(x_{t+1}|x_{t})$ . Note the Gibbs sampler as given above (systematic scan) is not reversible. However, the reversible ones can easily be produced, e.g., at each iteration picking a random component to update (random-scan). The random-scan Gibbs sampler can be viewed as a special case of MH sampler with acceptance ratio ${\rm min}(1,\frac{f(y)q(x|y)}{f(x)q(y|x)})=1$ . It follows that

[TABLE]

Here, $x^{-i}=\{x^{1},...,x^{i-1},x^{i+1},...,x^{d}\}$ and $y^{-i}=x^{-i}$ , as only the $i$ -th component is changed in each step.

3.4 Metropolis within Gibbs

One problem with the Gibbs sampler is that it requires one to sample from the conditional distributions which can be difficult. In such cases, one can replace the sampling of conditional densities with the MH step. This then becomes the Metropolis within Gibbs (MWG) scheme (see Müller, 1991), which is shown in Algorithm 3 (it is implemented in the code that we provide).

Rather than updating all the variables step by step, one can also choose to update a subset of variables together, leading to block updates. The fact that the full sampling of a complicated distribution can be broken up into a sequence of smaller and easier samplings, is the main strength of the Gibbs sampler and has resulted in its widespread use (e.g. Sale, 2012; Sharma et al., 2014).

3.5 Adaptive Metropolis

The efficiency of the MH algorithm depends crucially upon the proposal distribution. By efficiency we typically mean how independent are the samples. If the samples are not independent then they have high correlation. For Markov chains, the correlation falls off with distance between samples. If the correlation is large, this means the mixing in the chain is slow. If the width of the proposal distribution is too small, the acceptance ratio is high but the chain mixes very slowly. If the width of the proposal distribution is too large, the acceptance ratio is too small and the chain again mixes slowly (see Figure 5 for an illustration of this effect). Gelman, Roberts & Gilks (1996) showed that optimal covariance matrix $\Sigma$ for the RWMH algorithm using the multivariate normal distribution is $(2.38^{2}/\mathcal{D})\Sigma_{\pi}$ , where $\mathcal{D}$ is the dimensionality of the space and $\Sigma_{\pi}$ is the covariance matrix of the target distribution $\pi$ . The optimal acceptance ratio $\alpha_{\rm opt}$ is 0.44 for dimension $\mathcal{D}=1$ and then falls off with increasing number of dimensions reaching an asymptotic value of $0.23$ for $\mathcal{D}\to\infty$ . The convergence is quite fast ( $\alpha=$ [0.441, 0.352, 0.316, 0.279, 0.275, 0.266] for $\mathcal{D}=$ [1, 2, 3, 4, 5, 6]). The efficiency as compared to independent samples is $0.331/\mathcal{D}$ .

These results suggest a possible way to choose the optimal proposal distribution. Estimate the covariance matrix $\Sigma_{\pi}$ by a trial run and then use it for the actual run. Even doing this is cumbersome as it is unclear how long the trial run should be. To circumvent this, Haario, Saksman & Tamminen (2001) proposed an adaptive scheme in which $\Sigma$ is updated on the fly using past values. Naively, any scheme that uses proposals that depend upon the full past history violates the Markovian property, i.e., the future should only depend on the present and should be independent of the past. The trick is to adapt the proposal distribution in such a way that it converges to the optimal one. The resulting chain then also converges to the target distribution. Andrieu & Robert (2001) showed that such a scheme can be described as part of a more general adaptive framework.

At the heart of most adaptive algorithms is the Robbins & Monro (1951) recursion. They proposed an iterative stochastic algorithm to find roots of functions that are stochastic, i.e., their algorithm solves $M(x)=\alpha$ , where instead of $M(x)$ the function available is $N(x)$ , which is stochastic and is such that $\langle N(x)\rangle=M(x)$ . Starting with some initial value $x_{0}$ the algorithm to get the $n+1$ th iterate is

[TABLE]

Here $\gamma_{1},\gamma_{2},...$ is a sequence of positive steps. The $x_{n}$ then converge to the true solution provided the sequence $\gamma_{n}$ satisfies

[TABLE]

The first condition makes sure that irrespective of where we start, the solution can be reached in a finite number of steps. The second condition makes sure that we do converge. A possible choice of $\gamma_{n}$ is $\gamma_{n}=\gamma/n^{\beta}$ where $0<\beta<1$ .

A nice description of various adaptive algorithms is given by Andrieu & Thoms (2008). Below we discuss Algorithm 4 from their paper which is quite general and is implemented in the software that we provide.

If $\beta$ is too small, the convergence is too slow; if $\beta$ is too large the convergence is too fast and the simulation can quickly lean towards a wrong solution and will take a long time to get out of it. For adaptive MCMC, we find a choice of $\beta=0.6$ to be satisfactory for most test cases. Figure 5d shows an adaptive MCMC chain obtained using Algorithm 4. The adaptive chain looks very similar to the ideal case shown in Figure 5b, and this demonstrates the usefulness of the adaptive MCMC scheme.

3.6 Affine invariant sampling

An elegant solution to the problem of tuning the proposal density is to use the idea of ensemble samplers (Gilks, Roberts & George, 1994). Here multiple chains (walkers) are run in parallel but allowed to interact in such a way that they can adapt their proposal densities. Goodman & Weare (2010) provide a general purpose algorithm to do this, known as the affine invariant sampler (see also Christen, 2010). A python implementation of this (emcee: the MCMC hammer, http://dan.iel.fm/emcee/current/) is provided by Foreman-Mackey et al. (2013) and is widely used in astronomy. We now describe this algorithm.

We saw in the previous section that adapting the proposal density can violate the Markovian property of a chain. The trick lies in using the information available in the ensemble but in a way that does not violate the Markovian property. This is achieved by using the idea of partial resampling which is a generalized version of the Gibbs sampling procedure. Let us consider an ensemble of walkers $X=(x_{1},x_{2},...,x_{L})$ and a Markov chain that walks on a product space with distribution $\Pi(X)=\pi(x_{1})\pi(x_{2})...\pi(x_{L})$ . Then if $x_{i}$ is updated conditional on other walkers $x_{[-i]}=\{x_{1},...,x_{i-1},x_{i+1},...,x_{L}\}$ (complementary set of walkers), but satisfying detailed balance $p(y_{i}|x_{i},x_{[-i]})=p(x_{i}|y_{i},x_{[-i]})$ , then each walker samples from $\pi(x)$ .

One way to do this is to choose a point $x_{j}$ from $x_{[-i]}$ and a scalar $r$ with density $g(r)$ , and propose a new point $y$ as

[TABLE]

The inverse transformation is given by $x_{i}=x_{j}+(y-x_{j})/r$ . Now if we want the proposal to be symmetric then $q(y_{i}|x_{i},x_{[-i]})=q(x_{i}|y_{i},x_{[-i]})$ , and this implies $g(1/r)=rg(r)$ . A good choice of such a function is

[TABLE]

To satisfy detailed balance, the acceptance probability is given by $\textrm{min}\left[1,r^{n-1}\frac{\pi(Y)}{\pi(X_{i})}\right]$ . The factor $r^{n-1}$ is because the proposal is restricted along a line and not the full hypersphere over the actual space. This means an appropriate Jacobian has to be calculated; for details see Gilks, Roberts & George (1994) and Roberts & Gilks (1994) (the proof is much easier when using the reversible jump MCMC formalism of Green (1995)).

Moves other than the stretch move can also be constructed, e.g. a proposal $y=x_{i}+W$ , where $W$ has a covariance computed from a subset of walkers in the complementary sample. It is also possible to construct algorithms which use a combination of both the stretch and the walk move. Although the Goodman & Weare (2010) affine invariant algorithm elegantly solves the problem of choosing a suitable proposal distributions, it has one drawback. The computational cost of warm-up scales linearly with the number of walkers. Note, like most other MCMC algorithms, multimodal distributions (distributions with many well separated peaks) also pose a problem for this algorithm.

3.7 Convergence Diagnostics

Having studied MCMC methods in order to sample from distributions, we now discuss how to detect convergence; i.e., how long should we run an MCMC chain. Several convergence diagnostics have been proposed in the literature. Cowles & Carlin (1996) provide a good review of 13 convergence diagnostics. Other reviews include Brooks & Gelman (1998) and Robert & Casella (2013). Unfortunately, because there is no method to detect convergence, we can only detect failure to converge. So convergence diagnostics are necessary conditions but not sufficient. Below we present two schemes to monitor convergence. The first scheme makes use of the correlation length of the chain to compute the effective number of independent samples in a chain. The second scheme makes use of multiple chains to see if they are converging.

3.7.1 Effective sample size

Let us begin by estimating how many independent samples we need to get reliable estimates of mean and variance of a quantity. For a posterior of some variable $x$ with standard deviation $\sigma_{x}$ , the Monte Carlo standard error goes as $\sigma_{x}/\sqrt{N}$ for sample of size $N$ . So to measure the mean of a quantity with about 3% error as compared to the overall uncertainty $\sigma_{x}$ we need $N=1000$ . Raftery & Lewis (1992) showed that to measure $0.025$ quantile to within $\pm 0.005$ with probability 0.95 requires about 4000 independent samples.

However, the MCMC is not an independent sampler. As we have seen, the points in an MCMC chain are correlated. Autocorrelation provides a measure of this. Autocorrelation $\rho_{xx}(t)$ for a sequence is the correlation between two points separated by a fixed distance $t$ ; i.e.

[TABLE]

An automatic windowing procedure is discussed by Sokal (1997) for the computation of integrated autocorrelation (see also Goodman & Sokal, 1989; Goodman & Weare, 2010). Typically the autocorrelation falls off exponentially as $\sim\exp^{-t/\tau_{x}}$ and $\tau_{x}$ is known as the correlation time (or correlation length). The integrated autocorrelation is defined as $\tau_{\rm int,x}=(1/2)\sum_{t=-\infty}^{\infty}\rho_{xx}(t)$ . The variance of the mean of $x$ for a sample of size $N$ can be shown to be

[TABLE]

So for correlated samples the variance is $2\tau_{\rm int,x}$ times larger than the variance of independent samples. Using $\tau_{\rm int,x}$ , one can measure the number of effective independent samples in a correlated chain $-$ also known as the effective sample size (ESS) $-$ as $N/(2\tau_{\rm int,x})$ and then use it to decide if we have enough samples (e.g., $1000<\>$ ESS $\><4000$ ).

3.7.2 Variance between chains

The most widely used criterion for studying convergence was first presented by Gelman & Rubin (1992). Let us suppose we have $M$ chains each consisting of $2N$ iterations out of which we use only the last $N$ iterations. For any given scalar parameter of interest $\theta$ , let

[TABLE]

The index $i$ runs over points in a chain, and the index $j$ runs over the chains. Then the between chain variance and the mean within chain variance can be written as

[TABLE]

The total variance $\hat{\sigma}^{2}$ for the estimator $\bar{\theta}$ can be written as a weighted average of $W$ and $B$ , $\hat{\sigma}^{2}=W(n-1)/n+B$ . If we account for the sampling variability of the estimator $\bar{\theta}$ , then this yields a pooled variance of

[TABLE]

for the mixture of chains. If the initial distribution is over-dispersed, then $B>\sigma^{2}$ and $V$ always overestimates the true variance $\sigma^{2}$ . For any finite $n$ , $W$ is expected to be less than $\sigma^{2}$ , as individual sequences in a chain would not have had the time to explore the full target distribution. So, initially we expect $V/W>1$ . However, in the limit $n\to\infty$ , the variance $B$ between chains, which is expected to fall off as $1/n$ , goes to 0 and $W$ will approach the true variance $\sigma^{2}$ , making $V/W$ approach 1. Therefore the ratio $\hat{R}=\sqrt{V/W}$ , also known as the potential scale reduction factor, can be used to monitor the convergence.

3.7.3 Thinning

For making inferences from an MCMC chain, some algorithms use only the $k$ -th iteration of each sequence such that successive draws are approximately independent, a process known as thinning. However, there is no additional advantage of thinning other than savings in storage. Since we are throwing away information, an estimate from a thinned chain can never be better than the original chain (Geyer, 1992; MacEachern & Berliner, 1994). Moreover, it is difficult to choose an appropriate $k$ without studying the autocorrelation of the full chain. So thinning is useful only in situations where the autocorrelation is known a priori and is known to be large. Here again $k$ should be chosen such that it is smaller than the autocorrelation length, to retain as much information as possible.

3.8 Parallel Tempering

Multimodal distributions in general pose problems for all MCMC algorithms. Parallel tempering is one way to address this problem. It is a type of ensemble sampler where multiple chains are simulated in parallel but are allowed to exchange information. Each chain has a target distribution different from the other and is controlled by a parameter $T$ known as the temperature. Let $\pi(x)=\exp(-H(x))$ be the actual target distribution, then a ladder of distributions

[TABLE]

is created, controlled via the parameter $T_{i}$ , such that $T_{1}>T_{2}>...>T_{n}$ . $T_{n}$ is set to 1. Hence, $\pi_{n}$ represents the target distribution. The temperature broadens the target distribution and allows a wider exploration of the parameter space which makes it useful to explore multimodal distributions. To exchange information between the chains, a state swapping procedure is used. A swap is proposed between a randomly chosen chain $i$ and its neighbors $i-1$ and $i+1$ with probability $q_{i,i-1}=q_{i,i+1}=0.5$ and $q_{1,2}=q(n,n-1)=1$ . Naively, accepting the swap will violate the detailed balance condition. So the swap proposal is accepted with probability

[TABLE]

which satisfies detailed balance.

In parallel tempering the temperature ladder needs to be chosen carefully. If the neighboring temperatures are too far apart, the acceptance rate will be diminished leading to slow mixing. If the neighboring temperatures are too close, a large number of elements in the ladder will be required to explore a wide range in parameter space, and this can increase the computational cost significantly. However, by exploiting the trial runs, a suitable ladder can be constructed (Liang, Liu & Carroll, 2011).

The idea of parallel tempering can be generalized to construct evolutionary algorithms that incorporate features of genetic algorithms into the framework of MCMC. The basic idea is to have parallel chains as in parallel tempering and allow exchange of information while satisfying detailed balance on the product space defined by the chains. The exchange of information is based on ideas of mutation and crossover from genetic algorithms (Liang & Wong, 2001a, b).

3.9 Monte Carlo Metropolis Hastings

In MCMC based Bayesian inference, we are concerned with simulating samples from some pdf $p(\theta|x)=p(x|\theta)p(\theta)$ . However, there are situations when $p(x|\theta)$ cannot be easily evaluated or is not available in an analytically tractable form. In such situations one can make use of Monte Carlo based techniques to approximately evaluate $p(\theta|x)$ . More generally, the Metropolis Hastings ratio $r=p(\theta^{\prime}|x)/p(\theta|x)$ is used to update an MCMC chain. In such techniques, typically, one generates a set of auxiliary samples $Y=\{y_{1},...,y_{m}\}$ conditioned on $\theta$ and then uses them to compute $\tilde{p}(\theta|x,Y)$ (an approximation of $p(\theta|x)$ ) or $\tilde{r}$ (an approximation of the ratio $r$ ). However, Monte Carlo based estimates are stochastic and special care is needed when working with them in an MCMC scheme. An algorithm to make use of Monte Carlo based estimates inside a Metropolis Hastings algorithm is given in Algorithm 5 (it is implemented in the software that we provide).

There are many variants of Algorithm 5, depending upon how and at what stage the auxiliary sample is generated $-$ see Chapter 4 in Liang, Liu & Carroll (2011). The invariant stationary distribution of such Markov chains is not necessarily the target density $p(\theta|x)$ . The characteristics of such chains and their convergence properties are discussed by Beaumont (2003) and Andrieu & Roberts (2009). In Algorithm 5, the auxiliary sample is refreshed in each iteration and the same sample $Y$ is used to estimate both $\tilde{p}(\theta^{\prime}|x,Y)$ and $\tilde{p}(\theta|x,Y)$ . This makes Algorithm 5 more robust compared to other similar alternatives. In classical MCMC, one can reuse the previous estimate of $p(\theta|x)$ when computing $r$ . However, when the Metropolis Hastings ratio $\tilde{r}$ is stochastic, if $\tilde{p}(\theta|x,Y)$ is not evaluated in each iteration using a fresh sample of $Y$ , then the MCMC chain tends to get stuck at a stochastic maxima of the estimated likelihood (Sharma et al., 2014). The smaller the size of the auxiliary sample, or the more inaccurate the Monte Carlo estimate of $\tilde{r}$ , the worse is this problem. Using the same sample $Y$ to estimate both $\tilde{p}(\theta^{\prime}|x)$ and $\tilde{p}(\theta^{\prime}|x)$ leads to lower noise in the estimated ratio of $\tilde{r}$ . This property was also noticed by McMillan & Binney (2013) in the context of fitting models of the gravitational potential of the Milky Way to spatio-kinematic data of stars orbiting inside it. Two specific cases where the above algorithm can be used are given below.

3.9.1 Unknown normalization constant

In fitting a model to data, we are interested in sampling $p(\theta|x)=p(x|\theta)p(\theta)$ . To do this, the function $p(x|\theta)$ should be properly normalized over the data space, in the sense that $\int p(x|\theta)dx=1$ . However, on many occasions, we have

[TABLE]

where $f(x|\theta)$ is known but the normalization constant $Z(\theta)$ is not known. An example is the problem of fitting a density profile $\rho(r|\theta)$ ( $r$ being the Galactocentric distance) to a sample of stars with Galactic latitude $b>30^{\circ}$ , longitude $l>30^{\circ}$ and heliocentric distance $s<50$ kpc. Here we have $Z(\theta)=\int_{b=\pi/6}^{\pi/2}db\int_{0}^{50}ds\int_{\pi/6}^{2\pi}dl\rho(l,b,s|\theta)s^{2}cos(b)$ .

Our aim is to compute the Metropolis Hastings ratio $r=p(\theta^{\prime}|x)/p(\theta|x)=[Z(\theta)/Z(\theta^{\prime})][f(x|\theta^{\prime})/f(x|\theta)]$ that is used to advance an MCMC chain, and it is the ratio $R=Z(\theta)/Z(\theta^{\prime})$ that is unknown. If one can sample exactly from $f(x|\theta)$ , then it is possible to cancel the normalization constant using ingenious algorithms by Møller et al. (2006) and Murray, Ghahramani & MacKay (2006). However exact sampling is not always feasible. In such cases a Monte Carlo estimate of the ratio of the unknown normalization constant $R=Z(\theta)/Z(\theta^{\prime})$ can be done using samples $Y=(y_{1},...,y_{m})$ generated from density $f(y|\theta)$ , such that

[TABLE]

This sampling can be done by various means, e.g., exact sampling, MCMC, and rejection sampling. If $f(y|\theta^{\prime})$ is difficult to sample from, one can use so-called “importance sampling” by drawing samples from a distribution $g(y|\theta)$ that is easy to sample from. The required ratio of normalization constants is then given by

[TABLE]

and the MH ratio is given by $\tilde{r}=\tilde{R}(\theta,\theta^{\prime};Y)[f(x|\theta^{\prime})/f(x|\theta)]$ .

3.9.2 Marginal inference

Here we are interested in the marginal density $p(\theta|x)=\int p(\theta,y|x)dy$ , but the integral may not be analytically tractable and may also be difficult to do by deterministic schemes. In such situations, the integration can be done by Monte Carlo importance sampling, using auxiliary samples $Y$ generated from some density $g(y|\theta)$ that is easy to sample from. Thus we have

[TABLE]

3.10 Hamiltonian Monte Carlo

One of the attractive features of MCMC for sampling pdfs is its better performance for higher dimensions. However, for very large dimensions, traditional MCMC algorithms start running into problems. While for lower dimensions, a typical set of the posterior (e.g. region encompassing 99% of the total probability) lies close to the center, for higher dimensions, a typical set lies in a shell that has a very large volume. Since, a shell cannot be traversed with large step sizes, it takes a long time to explore the posterior.

Hamiltonian Monte Carlo (HMC) tries to address this problem by introducing an auxiliary variable called momentum $u$ for each real variable $x$ called position (Duane et al., 1987; Neal, 1993). The log of posterior (target density) $\pi(x)$ is assumed to define the potential energy $U(x)=-\ln\pi(x)$ , and the momenta define the kinetic energy $K(u)$ . Together they define the Hamiltonian $H(x,u)=U(x)+K(u)$ , where $K(u)=u^{2}/2$ . The distribution to be explored is

[TABLE]

Next, principles of Hamiltonian dynamics are used to advance a given point to a new location. The point is then accepted or rejected based on the MH algorithm. The use of Hamiltonian dynamics to advance a given point allows the point to travel to locations which are far from its current location. This allows faster exploration of the parameter space.

There are two major obstacles involved with using HMC, and this has prevented its widespread use. First, it requires the gradient of the target density. Secondly, it requires two extra parameters to be tuned by the user: a step size $\epsilon$ to advance from the current state and the number of steps over which to evolve the Hamiltonian system. Considerable progress has been made to address both these issues.

The automatic/algorithmic differentiation can be used to accurately compute the derivatives of a given function without any user intervention (Griewank & Walther, 2008). The idea is that any function written as a computer program can be described as a sequence of elementary arithmetic operations, and then by applying the chain rule of derivatives repeatedly on these operations, the derivatives can be computed. Alternatively, one can create analytical functions to approximate the target density and use these to compute the derivatives. This is because the exact Hamiltonian is only required when computing the acceptance probability and this does not require derivatives. For simulating the trajectory, one needs derivatives and here one can use an approximate Hamiltonian (Neal, 2011). An application of HMC for fitting cosmological parameters is given by Hajian (2007) and Taylor, Ashdown & Hobson (2008). Homan & Gelman (2014) provide additional algorithms for automatic tuning of step size $\epsilon$ and the number of steps $L$ , known as the No-U-Turn Sampler. This is used in the open-source Bayesian inference package Stan (available at http://www.mc-stan.org).

3.11 Population Monte Carlo

Population Monte Carlo is an iterative importance sampling technique that adapts itself at each iteration and produces a sample approximately simulated from the target distribution. The sample along with its importance weights can be used to construct unbiased estimates of quantities integrated over the target distribution. Suppose $h(x)$ is a quantity of interest. One of the major applications for MCMC applications is to compute integrals like $J=\int h(x)\pi(x)dx$ . In importance sampling, this is replaced by

[TABLE]

where $(x_{1},...,x_{n})$ are sampled from a distribution $q(x)$ which is easier to sample than $\pi(x)$ . The closer the importance function to the target distribution, the better the quality of the estimate (lower variance). In practise it is difficult to guess a good importance function.

The main idea in population Monte Carlo is to start with a reasonable guess of the importance function $q_{0}$ and then iteratively improve $q_{t}$ by making use of the past set of samples $(x_{1}^{t-1},...,x_{N}^{t-1})$ . The importance function can adapt not only in time (with each iteration), but also in space, and can be written in general as $q_{t}(.|x_{i}^{t-1})$ . Suppose $X^{t}=\{x_{1}^{t},...,x_{N}^{t}\}$ are the set of points at iteration $t$ . Let $x_{i}^{t}$ be produced from importance distribution $q_{t}(x|x_{i}^{t-1})$ . An estimate of $J$ is then given by

[TABLE]

Thus the expectation value of any function $h(x)$ computed using importance sampling is unbiased, i.e.

[TABLE]

Here $g$ is distribution of $X^{t-1}$ and the equality is valid for any $g$ .

A simple choice for the importance function is to have set it as a mixture of normal or $t$ -distributions, e.g., $q^{t}(x)=\sum_{d=1}^{D}\alpha_{d}^{t}\mathcal{N}(x|\mu_{d}^{t},\Sigma_{d}^{t})$ (Cappé et al., 2008). This has been used for cosmological parameter estimation (Wraith et al., 2009) and model comparison (Kilbinger et al., 2010).

3.12 Nested Sampling

In Section 2.5, we saw that computing the evidence is computationally challenging. Nested sampling (Skilling, 2006) is designed to ease this computation. To compute the evidence, we are interested in computing quantities like

[TABLE]

Integration is basically chopping up the full space into small volume elements and summing the contribution of the integrand over these cells. We are free to chop up the volume and order or label the cells as we wish. So we divide the space by iso-likelihood contours and define a variable $X$ to label them. A convenient choice is the prior probability mass enclosed by an iso-likelihood contour, i.e.

[TABLE]

If the the prior probability is normalized, then it ranges from 0 for the highest likelihood, to 1 for the lowest likelihood. Given the above definition, we can also define an inverse function $L(X)$ , which is the likelihood that encloses a probability mass of $X$ . So the integral for $Z$ can now be written as $Z=\int L(X)dX$ .

Suppose we generate $N$ samples uniformly from the prior distribution. Next, we sort them in decreasing sequence of $L$ to give prior mass $X_{i}=i/N$ . Then using trapezoidal rule, one can easily perform the numerical integration. However, a significant contribution to the integral comes from a region with small prior mass $X$ . So, the integral should be done in equal steps in $\ln(X)$ rather than $X$ . This can be done using an iterative procedure. We start with a set $A$ of $N$ points drawn from the prior. At each iteration, let $L_{i}$ be the point with lowest $L$ ; we replace it in set $A$ with a new point drawn uniformly from the prior but satisfying $L>L_{i}$ . This generates a sequence of $L_{i}$ for which the expected $X_{i}=\exp(-i/N)$ .

Nested sampling is widely used for cosmological model selection and parameter estimation. Three publicly available packages based on nested sampling are CosmoNest (Parkinson, Mukherjee & Liddle, 2006; Mukherjee, Parkinson & Liddle, 2006, see https://github.com/dparkins/CosmoNest), MultiNest (Feroz, Hobson & Bridges, 2009, see https://ccpforge.cse.rl.ac.uk/gf/project/multinest) and DNEST (Brewer, Pártay & Csányi, 2011, see https://github.com/eggplantbren/DNest4).

4 Bayesian hierarchical modelling (BHM)

In the simplest setting, we have some observed data $Y$ generated by some model having parameters $\theta$ which can be inferred using the Bayes theorem as

[TABLE]

where $p(\theta)$ denotes our prior knowledge or belief about $\theta$ . If the model parameters $\theta$ depend upon another set of parameters $y$ through $p(\theta|\phi)p(\phi)$ , then $\theta$ and $\phi$ can be inferred using

[TABLE]

The variable $\phi$ is known as the hyperparameter and $p(\phi)$ , the distribution of the hyperparameter, as a hyperprior. Alternatively, the observed data $Y$ may depend upon another set of hidden variables $X$ , which in turn depend on $\theta$ . The inference of $\theta$ and $X$ can then be established using

[TABLE]

Such situations lead to hierarchies and Bayesian models of this type are known as hierarchical models. It turns out that hierarchies are quite common in real world applications, often where more than two levels exist, and Bayesian hierarchical modelling provides a framework for capturing this.

Let us consider a simple example, for details see Gelman et al. (2013). Suppose we observe some data $Y$ (a set of measurements of some variable $y$ ) with uncertainty $\sigma$ , and we are interested in the mean $\alpha=\bar{y}$ . Now suppose that the data $Y=\{y_{ij}|0<j<J,0<i<n_{j}\}$ are grouped into $J$ independent groups, and we have reason to believe that the group mean $\alpha_{j}$ varies from group to group. For observations within a group $j$ , our model is

[TABLE]

where we denote by $y_{.j}$ an observation belonging to group $j$ . We now compute the group mean $\bar{y}_{.j}$ , instead of global mean $\bar{y}$ , to capture the variation of mean across groups. A global mean is certainly an inaccurate description of data, whenever the group mean is far away from the global mean. However, if the number of data points in a group is very small, e.g., $n_{j}=2$ , then the uncertainty in the group mean is large and it is much better to trust the global mean than the group mean.

Bayesian hierarchical modelling provides a natural way to handle the above problem of group means. It can act like a middle ground between the two extremes, global mean versus group mean. To demonstrate this, we set up the above problem using a Bayesian hierarchical model. Suppose the group means are distributed according to a normal distribution

[TABLE]

where $\mu$ and $\omega$ are unknown parameters of the model. The $\mu$ , $\omega$ and the group means $\alpha=\{\alpha_{1},...,\alpha_{J}\}$ can then be inferred from data $Y$ using

[TABLE]

We generated synthetic data with $\mu=0$ , $\omega=1$ , $J=40$ , $\sigma=1$ and $2<n_{j}<10$ ; we then estimated $\alpha$ , $\mu$ and $\omega$ (assuming flat priors for $\mu$ and $\omega$ ). The results are shown in Figure 6. The BHM based group mean estimates are systematically shifted with respect to standard group mean estimates ( computed from the data points in a group). The BHM estimates are closer to the global mean than the standard estimates. The shift between the two estimates is more for cases where the error bars are large. The BHM estimates also have smaller error bars. This is because , when estimating the group mean, in addition to points within a group the BHM model also makes use of information available from other groups.

4.1 Expectation maximization, data

augmentation and Gibbs sampling

The easiest way to analyze a Bayesian hierarchical model is via Gibbs sampling, and the motivation for doing this was provided by the the expectation maximization (EM) algorithm. In fact, the EM algorithm led to the development of the DA algorithm, which in turn provided the idea to use Gibbs sampling to solve Bayesian hierarchical models.

Hence we begin by exploring the EM algorithm (Dempster, Laird & Rubin, 1977) which is one of the most influential algorithm in the field of statistics. Let us suppose that we have some observed data $x=\{x_{1},...,x_{N}\}$ generated by some model $p(x|\theta)$ having parameters $\theta$ . We want to compute the most likely parameters of the model given the data, i.e., $\hat{\theta}={\rm argmax}_{\theta}[p(x|\theta)]$ . The full model is specified by $p(x,z|\theta)$ with $p(x,z|\theta)=\prod_{i=1}^{N}p(x_{i},z_{i}|\theta)$ , where $z$ are variables which are either missing or hidden or unobserved. The EM algorithm solves this problem as follows. The algorithm has two steps. It starts with a fiducial value of $\theta_{0}$ , then does the following at every iteration $t$ .

•

E-step: Compute $Q(\theta|\theta_{t},x)=\int dz\ p(z|\theta_{t},x)\log p(x,z|\theta)$ . In other words, it computes the expectation of the log likelihood $\log p(x,z|\theta)$ with respect to $p(z|\theta_{t},x)$ .

•

M-step: Find the value of $\theta$ that maximizes $Q(\theta|\theta_{t},x)$ and set $\theta_{t+1}={\rm argmax}_{\theta}[Q(\theta|\theta_{t},x)]$ .

These steps are repeated iteratively until $\theta_{t+1}\sim\theta_{t}$ . The proof that the EM algorithm increases the likelihood $p(x|\theta)$ at each stage is as follows. The conditional density of the missing data $z$ given the observed data $x$ and the model parameter $\theta$ is given by

[TABLE]

Taking the $\log$ and then the expectation with respect to $p(z|\theta_{t},x)$ , we get

[TABLE]

which is valid for any $\theta$ . Using this result, we can compute the difference

[TABLE]

Due to the M-step, $Q(\theta_{t+1}|\theta_{t},x)-Q(\theta_{t}|\theta_{t},x)\geq 0$ . Also, from Gibbs’ inequality, $S(\theta_{t+1}|\theta_{t})-S(\theta_{t}|\theta_{t})\geq 0$ . This means that each EM iteration is guaranteed to increase the marginal likelihood $p(x|\theta)$ . This guarantees a convergence towards a maximum, but not necessarily a global maximum. The algorithm can still get stuck at a saddle point, or a local maximum.

The EM algorithm as presented above is deterministic. In general, it is not always easy to compute the expectation value, as it involves integrals over high dimensions. A general way to compute the $Q(\theta|\theta_{t},x)$ , would be to draw $m$ random samples of $z$ from distribution $k(z|\theta_{t},x)$ and take its mean. We label this stochastic estimate $Q_{S}(\theta|\theta_{t},x)$ , which in the limit $m\to\infty$ is same as $Q(\theta|\theta_{t},x)$ . Having computed $Q_{S}$ , the M-step can proceed as usual to maximize it and compute a new $\theta_{t+1}$ . In fact $m$ can be set to 1. This is the stochastic version of EM (SEM) as given by Celeux & Diebolt (1985). Because of stochasticity, one does not get a unique answer but instead a distribution. In fact, SEM generates a Markov chain, which under mild regularity conditions converges to a stationary distribution. The algorithm has an additional advantage in that it is less likely to get stuck at a local maximum.

If we now replace the M-step with a draw of $\theta$ from the $Q_{S}(\theta|\theta_{t},x)$ , this becomes a fully stochastic method ; this is, as previously mentioned the DA algorithm of Tanner & Wong (1987). This is equivalent to a two-step Gibbs Sampler for sampling from

[TABLE]

Sample $Z_{t+1}$ from $p(Z|\theta_{t},X)$ . 2. 2.

Sample $\theta_{t+1}$ from $p(\theta|Z_{t+1},X)$ .

From the properties of the Gibbs sampler, we know that the sequence of $(\theta_{t},Z_{t})$ forms a Markov chain that samples $p(\theta,Z|X)$ . Although Gibbs sampling requires sampling from the conditional distribution, the inner step can be replaced by MH sampling, leading to the Metropolis-within-Gibbs method as discussed in Section 3.3. This provides a completely general scheme for handling missing data.

Finally, the DA algorithm is not limited to just missing variables of the data, but can also be applied to unknown parameters of the model, e.g., $\alpha$ in

[TABLE]

Such dependencies are common in Bayesian hierarchical modeling. In general, the Bayesian hierarchical modeling provides a framework for handling marginalization in Bayesian data analysis, i.e., handling parameters or variables that are either unknown or missing but are necessary to model the data.

4.2 Handling uncertainties in observed data

Marginalization is not limited to handling missing data. It can also be used to handle data $X=\{x_{i}|i=1,...,N\}$ with uncertainty $\sigma_{X}=\{\sigma_{x,i}|i=1,...,N\}$ . Consider

[TABLE]

where $X_{t}=\{x_{i}^{t}|i=1,...,N\}$ is the true values of the observed data $X$ . Here again, instead of doing an integration, one treats the true values as unknowns and sample them using the Gibbs scheme. We demonstrate this with a simple example where $p(x_{i}^{t}|\theta)\sim\mathcal{N}(x_{i}^{t}|\mu,\sigma^{2})$ is the model that generates the data, and $\theta=(\mu,\sigma)$ are the unknowns which we wish to evaluate. The data has uncertainty described by another Gaussian function $p(x_{i}|x_{i}^{t},\sigma_{x,i})\sim\mathcal{N}(x_{i}|x_{i}^{t},\sigma_{x,i}^{2})$ .

For this simple case, the integral in Equation (83) leads to an analytical expression

[TABLE]

We used $(\mu,\sigma)=(0.0,1.0)$ and $\sigma_{x}=0.5$ to generate test data and then estimated $\mu$ and $\sigma$ using two schemes: (1) DA algorithm which uses Equation (84) and treats $X_{t}$ as unknown and samples from it, and (2) explicit integration scheme which uses Equation (85) where the variable $x_{i}^{t}$ has been integrated out of the equation. The Markov chain was run for 100,000 iterations. Figure 7 $a,b$ shows the pdf of the estimates of the two parameters. Both schemes give identical results. The autocorrelation function for the two parameters are shown in Figure 7 $c,d$ . The DA algorithm has a slightly higher autocorrelation time $\tau$ as it has to sample an extra parameter for each data point.

5 Case studies in astronomy

In this section, we study a range of cases in astronomy where MCMC based Bayesian analysis is making a significant impact. The emphasis is on showing how to set up a diverse range of problems within the Bayesian framework and how to solve them using MCMC techniques. The examples are intentionally chosen from different areas of astronomy so as to demonstrate the ubiquity of the techniques reviewed here. There is a long history of applying such techniques in the field of cosmology, and excellent reviews and books already exist here: Trotta (2008); Hobson (2010); Parkinson & Liddle (2013).

5.1 Exoplanets and binary systems using radial

velocity measurements

The presence of a planet or a companion star results in temporal variations in the radial velocity of the host star. By analyzing the radial velocity data, one can draw inferences about the ratio of masses between the host and the companion, and orbital parameters like the period and eccentricity. We now describe how to set up the above inference problem in a Bayesian framework. We begin by describing the predictive model for the radial velocity of a star in a binary system.

The radial velocity of a star of mass $M$ in a binary system with companion of mass $m$ in an orbit with time period $T$ , inclination $I$ and eccentricity $e$ is given by

[TABLE]

The true anomaly $f$ is a function of time, but depends upon $e$ , $T$ , and $\tau$ via,

[TABLE]

An example of radial velocity data is shown in Figure 8 which shows the radial velocity for two binary systems (the green and the red line) that differ in $e$ but have same values for all other parameters $\kappa,T,\tau,\omega$ and $v_{0}$ . The figure demonstrates that the radial velocity is sensitive to the eccentricity of the orbit. {marginnote} \entry $v_{0}$ the mean velocity of the center of mass of the binary system \entry $I$ the inclination of the orbital plane with respect to the sky (angle between orbital angular momentum and line of sight) \entry $\omega$ the angle of the pericenter measured from the ascending node (the point where the orbit intersects the plane of the sky) \entry $\tau$ time of passage through the pericentre

The actual radial velocity data will differ from the perfect relationship given in Equation (86) due to observational uncertainty (variance $\sigma_{v}^{2}$ ) and intrinsic variability of a star (variance $S^{2}$ ) and we can model this by a Gaussian function $\mathcal{N}(.|v,\sigma_{v}^{2}+S^{2})$ . For radial velocity data $D$ defined as a set of radial velocities $\{v_{1},...,v_{M}\}$ at various times $\{t_{1},...,t_{M}\}$ , one can fit and constrain seven parameters, $\theta=(v_{0},\kappa,T,e,\tau,\omega,S)$ , using the Bayes theorem as shown below

[TABLE]

We generated test data using Equation (86) and then, using the above equation, we tried to recover the parameters $\theta$ (available in the supplied software). The posterior distribution $p(\theta|D)$ was sampled using MCMC, and the results are shown in Figure 8. Panel $a$ shows the test data along with the best fit curve. It also shows the radial velocity for the case with $e=0$ . Panel $b$ shows the posterior distribution of the parameters $\kappa,T$ and $e$ .

If we have data for a large number of binary systems, we can use it to explore the distribution of orbital parameters. A naive way to do this would be to get a “maximum a posteriori” (MAP) estimate of the orbital parameters for each star and then study the population distribution by constructing histograms out of it. Such a scheme will give incorrect estimates of the population distribution as the uncertainty associated with the parameter estimates is ignored. In addition to this, as discussed by Hogg, Myers & Bovy (2010), the MAP estimates are in general biased. In the context of radial velocity data, the estimates of $e$ are biased high. The problem is especially acute if the uncertainty associated with the parameters is large, which is often the case with radial velocity data from barycentric motions.

All of these problems can be avoided by setting up the problem of estimation of population distributions as a hierarchical Bayesian model. Let us suppose we have radial velocity data for $N$ binary star systems, and denote by $y_{i}$ the radial velocity data set for the $i$ -th system. Let $x_{i}=(v_{0i},\kappa_{i},T_{i},e_{i},\tau_{i},\omega_{i},S_{i})$ be the orbital parameters for the $i$ -th system. Finally, let $\alpha$ be the set of hyperparameters that govern the population distribution of the parameters $x$ . The problem to determine $\alpha$ can be set up as

[TABLE]

This is a BHM and can be sampled using the Metropolis-within-Gibbs scheme discussed in Section 4.1. The parameters $x_{i}$ can be estimated alongside $\alpha$ , and to get the marginal distribution $p(\alpha|\{y_{i}\})$ , one can simply ignore the computed $x_{i}$ .

However, the above scheme is not well suited to explore a variety of population models, especially if sampling from $p(y_{i}|x_{i})p(x_{i}|\alpha)$ is computationally demanding. We now show a computationally efficient scheme by Hogg, Myers & Bovy (2010) that can in general be applied to BHMs of two levels. The marginal distribution of hyperparameters that we are interested in is given by

[TABLE]

The integral on the right hand side can be estimated using a Monte Carlo integration scheme as follows:

[TABLE]

with $x_{ik}$ sampled from $p(x_{i}|y_{i})\propto p(y_{i}|x_{i})p(x_{i})$ , which can be done by an MCMC scheme.

5.2 Data driven approach to estimation of stellar parameters from a spectrum

The spectrum of a star contains information about its properties like temperature, gravity and the abundance of different chemical elements that make up the star. Decoding information about stellar parameters from a stellar spectrum is a problem of great significance for astronomy. With the advent of large spectroscopic stellar surveys having several hundred thousand spectra, the need for fast and accurate methods to analyze the stellar spectra has gained prominence. Let us denote the stellar parameters (e.g., $T_{\rm eff},\log g,{\rm[Fe/H]},$ and ${\rm[X/Fe]}$ ) by label vector ${\bf x}=(x_{1},...,x_{K})$ and the observed spectrum by vector ${\bf y}=\{y_{1},...,y_{L}\}$ , denoting normalized flux at specific wavelengths indexed by ${\bf\lambda}=(1,...,L)$ (see Figure 9). The problem is to find ${\bf x}$ given ${\bf y}$ , which using the Bayes theorem can be written down as

[TABLE]

Here $p({\bf y}|{\bf x},\theta)$ denotes a probabilistic generative model for the data, with $\theta$ being the parameters of the model. If we denote by $f_{\lambda}({\bf x},\theta_{\lambda})$ the flux predicted by the model at wavelength $\lambda$ and by $s_{\lambda}^{2}$ the variance or scatter about this relation (assuming Gaussian noise), then the probabilistic generative model for the full spectrum can be written as

[TABLE]

Traditionally, $f_{\lambda}({\bf x},\theta_{\lambda})$ is calculated from first principles using a physical theory for the formation of spectral lines in a stellar atmosphere specified by stellar parameters ${\bf x}$ . Frequently, $f_{\lambda}({\bf x},\theta_{\lambda})$ is evaluated on a grid defined on $x$ and then interpolation is used to get the spectrum for any arbitrary value of ${\bf x}$ . The $f_{\lambda}({\bf x},\theta_{\lambda})$ can also be computed by interpolating over a library of empirical spectra with predefined stellar parameters. A more refined data driven approach to the problem using machine learning techniques was presented in Ness et al. (2015). In this approach, $f_{\lambda}({\bf x},\theta_{\lambda})$ is approximated by a simple (linear or quadratic) function of label vector ${\bf x}$ . Therefore

[TABLE]

Let us consider a training set of $N$ stars with label vectors $X=\{{\bf x}^{1},...,{\bf x}^{N}\}$ and corresponding set of fluxes at wavelength $\lambda$ by $Y_{\lambda}=\{y_{\lambda}^{1},...,y_{\lambda}^{N}\}$ . One can estimate $\theta_{\lambda}$ by sampling within MCMC such that

[TABLE]

Having obtained the model parameters $\theta=\{\theta_{1},...,\theta_{L},s_{1},...,s_{L}\}$ , one can now estimate stellar parameters ${\bf x}$ of a new star with given spectrum ${\bf y}$ using Equation (92). This is the basis of The Cannon algorithm (Ness et al. 2015) which is already widely used by the stellar community. The ability of the algorithm to model the spectra is demonstrated in Figure 9 which shows the spectra of fours stars along with the best-fit spectra for each of them.

5.3 Solar-like oscillations in stars

Solar-like oscillations, which are excited and damped in the outer convective envelopes of a star, are seen in stars like the Sun and red giants. With the advent of space-based missions like Kepler and COROT that provide high quality photometric data over a long time series, it has now becomes feasible to detect solar-like oscillations in tens of thousands of stars (Stello et al., 2013, 2015). Typically, the power spectrum of a star with solar-like oscillations (Figure 10) shows a regular pattern of modes, characterized by a large frequency separation $\Delta\nu$ . The overall amplitude is modulated by a Gaussian envelope and this is characterized by the frequency of maximum oscillation $\nu_{\rm max}$ . Theory suggests that $\Delta\nu$ for a given star is related to its density (Ulrich, 1986), whereas the $\nu_{\rm max}$ is related to its surface gravity and temperature (Brown et al., 1991; Kjeldsen & Bedding, 1995). Using the above two relations, the mass and the radius of a star can be constrained. The mass of a red giant is sensitive to its age and this makes asteroseismology very useful for understanding Galactic evolution (Chaplin et al., 2011; Sharma et al., 2016). For further details on solar type oscillations see review by Chaplin & Miglio (2013).

Bayesian-MCMC based techniques are increasingly being adopted to extract seismic properties, e.g., $\Delta\nu$ and $\nu_{\rm max}$ , by analyzing the power spectrum generated from the time series photometry of a star (Gruberbauer et al., 2009; Kallinger et al., 2010; Handberg & Campante, 2011). The probability that an observed power spectrum $\mathbf{\Gamma}=\{\Gamma_{1},...,\Gamma_{N}\}$ at frequencies $\mathbf{\nu}=\{\nu_{1},...,\nu_{N}\}$ is produced by a model spectrum $\Gamma(\nu;\theta)$ (specified by a set of parameters $\theta$ ), is given by

[TABLE]

as shown by Duvall & Harvey (1986). This forms the basis for the Bayesian treatment of the problem of estimation of parameters $\theta$ by $p(\theta|\{\Gamma_{i}\})=p(\{\Gamma_{i}\}|\theta)p(\theta)$ . The power density is modelled as a sum of super-Lorentzian functions

[TABLE]

To fit the individual modes, one assumes Lorentzian profiles. Spherical harmonics are used to describe the oscillations; the modes are characterized by three wave numbers, $n,l$ and $m$ . In Kallinger et al. (2010), eight main modes are fitted (three $l=0$ and $l=2$ and two $l=1$ ), parameterized by the mode lifetime $\tau$ , the central frequency $\nu_{0}$ , three spacings $\Delta\nu,\delta\nu_{01}$ and $\delta\nu_{01}$ , and the amplitudes $A_{i},A_{j}$ and $A_{k}$ .

[TABLE]

Figure 10 shows the result of fitting the above model to power spectra of fours stars observed by the Kepler mission.

5.4 Extinction mapping and estimation of intrinsic

stellar properties

Given the mass $m$ and initial composition (e.g., metallicity [M/H]) of a star, we can use the theory of stellar evolution to predict its state and composition at a later time (age $\tau$ ). However, the intrinsic parameters like mass $m$ , [M/H] and $\tau$ are not directly observable. For most stars we only have photometric information, apparent magnitudes in different photometric bands (for example $J,Ks,u,g,r$ and $i$ ). The photometry of a star depends upon temperature $T_{\rm eff}$ , gravity $g$ , [M/H], distance $s$ and extinction $E$ (proportional to the dust density integrated along the line of sight to the location of the star). If we have spectroscopy, then we can get temperature $T_{\rm eff}$ , $g$ and even composition, but with uncertainties. From asteroseismology, we can get average seismic parameters like $\Delta\nu$ and $\nu_{\rm max}$ , which are sensitive to the mass, radius and temperature of a star. Given this state of affairs, it is quite common to ask the question that, given a certain set of observables of a star, what are the intrinsic parameters of a star or even some other set of observables. For example, given the photometry of a star, what is the distance, temperature and gravity of a star; or given photometry and distance, what is the temperature and gravity of a star; or given photometry and spectroscopy, what is the distance? And so on. Knowing the intrinsic parameters of a star is also important for understanding the formation and evolution of the Galaxy, for example, the star formation rate, the age-metallicity relation and the distribution of dust in the Galaxy.

The problem of estimating intrinsic stellar parameters of a star given some observables can be formulated as follows. Let ${\bf y}=(J,J-Ks,J-H,T_{\rm eff},\log g,[M/H]_{\rm obs},l,b)$ be a set of observables associated with a star and $\sigma_{\bf y}$ their uncertainties. Let us denote the intrinsic variable of a star that we are interested in by ${\bf x}=([M/H],\tau,m,s,l,b,E)$ . To specify prior probabilities on ${\bf x}$ we need a Galactic model, and we denote by $\theta$ the parameters of such a model. Typically, real catalogs have selection effects, e.g., stars selected to lie in some apparent magnitude and color range, or a set of stars with parallax error less than 10%, or stars with missing information in certain bands. To specify selection effects, we denote the event that a star exists in a catalog by $S$ . From theoretical isochrones we can predict ${\bf y}$ given ${\bf x}$ , in other words a function ${\bf y(x)}$ exists. However, we are interested in the inverse problem of estimating ${\bf x}$ given ${\bf y}$ . A Bayesian introduction to solving such a problem was given by Pont & Eyer (2004) and Jørgensen & Lindegren (2005) in the context of estimating ages. The method was further improved and refined by Burnett & Binney (2010); Burnett et al. (2011) and Binney et al. (2014) in the context of the estimation of distances, with a better treatment of priors and selection effects (see also Sale, 2012, 2015). From the Bayes theorem we have

[TABLE]

We now explain each of the terms in detail.

$p({\bf x|y,\sigma_{y}},S,\theta)$ is the posterior distribution of intrinsic parameters given the observables, the selection function and a Galactic model. 2. 2.

$p(S|{\bf y,x,\sigma_{y}})$ is the selection function. This says given the observables what is probability that a star was observed. Typically this can be expressed as $p(S|{\bf y})p(S|{\bf x})$ . The term $p(S|{\bf x})$ enters in situations where the value of an observable $y^{\prime}$ is not known but constraints on it are. Then $p(S|{\bf x})=\int p(S|y^{\prime})p(y^{\prime}|x)dy^{\prime}$ . For example, a parallax of a star is known to be greater than a certain limit, or the apparent magnitude of a star may be missing in a band because the star is too bright or faint (Burnett & Binney, 2010; Sale, 2012). 3. 3.

$p({\bf y|x,\sigma_{y}})$ is the likelihood of the data given the uncertainty and the intrinsic parameters. This can be described by a Gaussian function $\mathcal{N}(y|y({\bf x}),\sigma_{y}^{2})$ for each $y\in{\bf y}$ . 4. 4.

$p({\bf x}|\theta)$ is the prior. This describes the distribution of mass, metallicity, age and spatial distribution of stars in the Galaxy. More specifically it can be written as $p(x|\theta)=\sum_{k}p_{k}(m)p_{k}([M/H])p_{k}(\tau)p_{k}(r)$ , where the sum is over different Galactic components, e.g., thin disc, thick disc, bulge and stellar halo.

We now focus on the problem of estimating distance and extinction. For simplicity, we ignore the selection effects; for an in depth discussion, see Sale (2015). By marginalizing over stellar parameters $\tau,m$ and $[M/H]$ one obtains $p(s,E|{\bf y,\sigma_{y}},\theta)$ . If we have $N$ stars along a line of sight, we can estimate the distance-extinction relationship $E(s_{i};\alpha)$ parameterized by $\alpha$ as

[TABLE]

The above method is used by Green et al. (2014, 2015), to construct three dimensional maps of interstellar dust reddening using Pan-STARRS 1 and 2MASS photometry (Figure 11). To estimate $p(s,E|{\bf y,\sigma_{y}},\theta)$ , Green et al. (2015) do a kernel density estimate over samples generated by MCMC, while Sale & Magorrian (2015) present a method based on the Gaussian mixture model. As described in Sale (2012), we can also directly estimate $\alpha$ and intrinsic parameters ${\bf x}$ of each star along a line of sight by setting up the problem as a BHM and sampling from the following posterior:

[TABLE]

The Metropolis-within-Gibbs scheme is used to accomplish this sampling.

5.5 Kinematic and dynamical modelling of the Milky Way

Understanding the origin and evolution of the Milky Way has received significant boost due to the emergence of large data sets that catalog the properties of stars in the Milky Way (Binney, 2011; McMillan & Binney, 2012, 2013; Rix & Bovy, 2013; Binney, 2013; Bland-Hawthorn & Gerhard, 2016). Bayesian methods and MCMC based schemes are now playing a prominent role in the analysis and interpretation of such large and complex data sets from, e.g., the GCS survey (Schönrich, Binney & Dehnen, 2010), the SEGUE survey (Bovy et al., 2012b), the APOGEE survey (Bovy et al., 2012a; Bovy & Rix, 2013), and the RAVE survey (Sharma et al., 2014; Piffl et al., 2014; Sanders & Binney, 2015). We focus on the problem of determining the mass distribution, or equivalently the gravitational potential of the Milky Way, using halo stars (Kafle et al., 2014) and disc masers (McMillan, 2017).

The observational data of stars in the Milky Way is in heliocentric coordinates and is in the form of angular positions on sky (Galactic longitude $\ell$ and latitude $b$ ), heliocentric distance ( $s$ ), heliocentric line of sight velocity ( $v_{\rm los}$ ), and proper motion (tangential motion on the sky, $\mu_{\ell}$ and $\mu_{b}$ ). The velocity of halo stars can be described by a simple Gaussian model of the following form

[TABLE]

for which $\theta_{v}$ is the set of parameters that govern the velocity dispersion profiles $\sigma_{vr},\sigma_{v\theta}$ and $\sigma_{v\phi}$ . The coordinates $(r,\theta,\phi)$ are in the Galactocentric reference frame. The observed heliocentric coordinates can be converted to Galactocentric coordinates using prior estimates of the location and the motion of the sun. For the stellar halo stars, tangential velocities cannot be accurately determined. The distance also has some uncertainty, $\sigma_{s}$ . Hence we marginalize over unknown tangential velocities and true distance $s^{\prime}$ , to obtain

[TABLE]

The parameters $\theta_{v}$ can now be estimated using the data $D$ of multiple stars by

[TABLE]

The marginalization in Equation (103) can be handled in various ways. One can make use of deterministic numerical integration techniques (Gaussian quadrature) or one can achieve marginalization via Monte Carlo schemes making use of importance sampling. For Monte Carlo based integration one can make use of the MCMH algorithm discussed in Section 3.9. Alternatively, one can treat $v_{\ell},v_{b}$ and $s$ as unknowns by setting them up as a BHM and estimate them alongside $\theta$ by making use of the Metropolis-within-Gibbs scheme discussed in Section 4.1. The radial velocity dispersion profile of halo stars computed using blue horizontal branch and red giant stars in the SEGUE survey is shown in Figure 12 (Kafle et al., 2014).

We now proceed to estimating the potential $\Phi$ . Given $\Phi$ , density of halo stars $\rho$ and anisotropy $\beta=1-(\sigma_{v\theta}^{2}+\sigma_{v\phi}^{2})/(2\sigma_{vr}^{2})$ as function of distance $r$ from the Galactic center, one can solve for $\sigma_{vr}(r)$ . Let $\theta$ be the set of parameters used to define the above profiles. So for given $\theta$ , the model makes a prediction for radial velocity dispersion $\sigma_{vr}(r_{i};\theta)$ at a location $r_{i}$ . This can be compared with the $\sigma_{vr}(r_{i})$ estimated from the observed data. The probability of model parameters $\theta$ is then given by

[TABLE]

The posterior distribution for the virial mass and the concentration parameter of the Milky Way halo using BHB and giant stars is shown in Figure 12 (Kafle et al., 2014).

We now discuss ways to incorporate prior information into the analysis. For example, the angular velocity of the Sun with respect to the Galactic Center $\omega$ is well constrained to be within $30.24\pm 0.12\>{\rm km\ s}^{-1}{\rm kpc}^{-1}$ (Reid & Brunthaler, 2004). The vertical force at $1.1$ kpc above the Sun, in terms of surface mass density, is given by $\Sigma_{1.1,\odot}=72\pm 6$ (Kuijken & Gilmore, 1991). Let us denote such constraints by $p(g_{j}(\theta)|\theta)$ . Additional data sets $D_{k}$ , constraining a certain subset of parameters can also exist. For example, the tangent point velocities or terminal velocities as a function of Galactic longitude $v_{\rm term}(\ell)$ help to constrain the shape of the circular velocity curve $v_{\rm circ}(R)=\sqrt{|Rd\Phi/dR|}$ . The additional priors and data all enter as multiplicative factors in the posterior, which is given by

[TABLE]

The halo stars carry little information about the mass distribution close to the center and in the disc of the Milky Way. Galactic masers associated with high mass star forming regions are very good tracers of the Milky Way disc which makes them excellent candidates for studying the potential of the Milky Way (Reid et al., 2009; McMillan, 2011; Reid et al., 2014; McMillan, 2017). Due to extremely accurate astrometric information using very long baseline interferometry, one has very accurate parallax ( $\varpi$ ) and proper motion measurements. When combined with line of sight velocities from Doppler shift of spectral lines, one ends up with full 6D phase space information for these sources. Maser sources, are young and have very little random motion which means their orbits are highly circularized. The distribution of velocities can be described by a simple three dimensional Gaussian function, i.e.

[TABLE]

Here, ${\bf v}_{\rm M}=(v_{R,{\rm M}},v_{\phi,{\rm M}},v_{z,{\rm M}})$ is any systematic streaming velocity associated with the masers and $\sigma_{v{\rm M}}$ is the velocity dispersion about the mean motion. Now, we have

[TABLE]

The last term is evaluated using Equation (107), by converting from heliocentric coordinates $(\mu_{\alpha}^{\prime},\mu_{\delta}^{\prime},v_{\rm los}^{\prime},\alpha,\delta,\varpi^{\prime})$ to Galactocentric coordinates $(v_{R},v_{\phi},v_{z},R,\phi,z)$ . Let $D_{1}$ denote the full data of $N$ stars then

[TABLE]

This when put in Equation (106) gives the posterior distribution of model parameters.

6 Concluding remarks

The power of the Bayesian probability theory lies in the fact that it is mathematically simple, being based on just two elementary rules, and yet it is broadly applicable. However, Bayesian calculations can be computationally demanding, and this has acted as a major bottleneck in the past. But with the increase of computational power, we have witnessed a sharp increase in the adoption of Bayesian techniques. More recently, free availability of black-box computer packages to efficiently sample from Bayesian posterior distributions has further accelerated the adoption of Bayesian techniques in astronomy.

Robust algorithms are now available to sample multidimensional and complex pdfs. The MH algorithm is still the main workhorse of MCMC methods. Good solutions now exist for the issue of application specific tuning of the proposal distribution in the MH algorithm, e.g., adaptive Metropolis schemes and the affine invariant samplers. The MH algorithm when combined with parallel tempering allows one to sample a wide variety of commonly occurring distributions. Situations, in which the posterior is not analytically tractable, can also now be solved using the Monte Carlo version of the MH algorithm.

Bayesian methods also provide a framework for model comparison via the use of Bayesian evidence. However, efficient computing of evidence still remains a challenge. Various alternate criteria for comparing models exist and importantly these can make use of the computed MCMC chain.

Bayesian hierarchical models further increase the usefulness of the Bayesian framework. They can solve missing data problems, marginalization over variables, convolution with observational uncertainties and so on. This makes a wide class of complex problems suddenly solvable. We showed that the Metropolis-within-Gibbs scheme is ideally suited for sampling posteriors generated by Bayesian hierarchical models and also provide a software for doing this.

Multimodal distributions still pose a problem for most MCMC algorithms. Parallel tempering can overcome them but requires more computational time and a careful choice of ladder. If dimensionality of the space being explored is very high and the distribution is complex, efficient exploration is not easy. Techniques are being developed to solve such problems that make use of derivatives of the posterior distribution, e.g., Hamiltonian Monte Carlo. However, more work is required in this area. Efficient exploration of multi-level hierarchical models will play an increasingly important role in future studies.

Communication of Bayesian results is also an area where we anticipate improvements. Traditionally, the estimates are reported by means of confidence intervals. However, there is much more information in the MCMC chain, in particular, the correlation between different variables. Also, there is an increasing need to feed results of one MCMC simulation into another. Such requirements are best addressed by reporting the full pdfs or the thinned samples from it. Other alternatives that are economical in terms storage space are to approximate the pdf by analytical functions or to employ Gaussian mixture models. We also need better tools to visualize the Bayesian-MCMC output, specially for high dimensional and complex hierarchical models. Such tools will allow us to understand as to why a model fails and how we should improve it.

There are key topics which we have not addressed here. Non-parametric Bayesian methods are becoming increasingly important, e.g., Gaussian processes (Beaumont, Zhang & Balding, 2002) and Dirichlet process mixture models (Neal, 2000). Magorrian (2014) uses this method to estimate the gravitational potential of the Milky Way.

Astronomy is no longer a data-starved science. With projects like the Large Synoptic Survey Telescope and the Square Kilometre Array, the quality and quantity of data are going to increase dramatically in the coming years. Better quality and larger quantity of data means that we can expect our data to answer more difficult questions, which in turn means more complex models (e.g. multi-level hierarchies and a higher dimensional parameter space). Given that MCMC is a computationally expensive scheme, there will be an increasing demand for such techniques that can make full use of the vast quantity of data on offer and deliver results in an affordable amount of time.

Equivalently, MCMC schemes that make use of computing environments with multiple processor and graphic processor units would also be useful. An MCMC chain is serial by nature and it requires special care to parallelize an MCMC algorithm, e.g., use of an ensemble of chains (Foreman-Mackey et al., 2013) or parallelizing the posterior computation by splitting up the data. Relaxing the condition of reversibility can lead to MCMC algorithms with faster mixing properties (Chen, Lovász & Pak, 1999; Diaconis, Holmes & Neal, 2000; Girolami & Calderhead, 2011). Finally, the development of approximate methods, both application specific and general, that can reduce the computational cost without significantly compromising the quality of results also hold great promise for analyzing large data sets. Approximate Bayesian computation is one such framework (Beaumont, Zhang & Balding, 2002); see Bovy (2016) for its use in astronomy to study the chemical homogeneity of stars in open clusters.

DISCLOSURE STATEMENT

The author is not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.

ACKNOWLEDGMENTS

I am indebted to my colleague Joss Bland-Hawthorn for suggesting this article and for supervising its development over the past year. I am thankful to James Binney, Jo Bovy, Brendon Brewer, Prajwal Kafle and Prasenjit Saha for numerous suggestions and discussions from which the review has benefited significantly. I am also thankful to David Hogg for words of encouragement on the draft. I acknowledge funding from a University of Sydney Senior Fellowship made possible by the office of the Deputy Vice Chancellor of Research, and partial funding from Bland-Hawthorn’s Laureate Fellowship from the Australian Research Council.

Bibliography145

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Akaike (1974) Akaike H. 1974. IEEE transactions on automatic control 19:716–723
2Andrieu & Robert (2001) Andrieu C, Robert CP. 2001. Controlled mcmc for optimal sampling. Tech. rep., Citeseer http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.2048
3Andrieu & Roberts (2009) Andrieu C, Roberts GO. 2009. The Annals of Statistics 37:697–725
4Andrieu & Thoms (2008) Andrieu C, Thoms J. 2008. Statistics and Computing 18:343–373
5Barker (1965) Barker A. 1965. Australian Journal of Physics 18:119–134
6Bayes & Price (1763) Bayes M, Price M. 1763. Philosophical Transactions 53:370–418
7Beaumont (2003) Beaumont MA. 2003. Genetics 164:1139–1160
8Beaumont, Zhang & Balding (2002) Beaumont MA, Zhang W, Balding DJ. 2002. Genetics 162:2025–2035

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Markov Chain Monte Carlo Methods for Bayesian Data Analysis in Astronomy

Abstract

doi:

keywords:

Contents

1 Introduction

1.1 Rise of MCMC based Bayesian methods in astronomy and science

2 Bayesian Data Analysis

2.1 Bayes’ Theorem

2.2 Fitting a model to data

2.3 Priors

2.4 Fitting a straight line

2.5 Model comparison

2.5.1 Bayesian model comparison

2.5.2 Predictive methods for Model comparison

3 Monte Carlo methods for Bayesian computations

3.1 Markov Chain

3.2 Metropolis Hastings algorithm

3.3 Gibbs sampling

3.4 Metropolis within Gibbs

3.5 Adaptive Metropolis

3.6 Affine invariant sampling

3.7 Convergence Diagnostics

3.7.1 Effective sample size

3.7.2 Variance between chains

3.7.3 Thinning

3.8 Parallel Tempering

3.9 Monte Carlo Metropolis Hastings

3.9.1 Unknown normalization constant

3.9.2 Marginal inference

3.10 Hamiltonian Monte Carlo

3.11 Population Monte Carlo

3.12 Nested Sampling

4 Bayesian hierarchical modelling (BHM)

4.1 Expectation maximization, data

4.2 Handling uncertainties in observed data

5 Case studies in astronomy

5.1 Exoplanets and binary systems using radial

5.2 Data driven approach to estimation of stellar parameters from a spectrum

5.3 Solar-like oscillations in stars

5.4 Extinction mapping and estimation of intrinsic

5.5 Kinematic and dynamical modelling of the Milky Way

6 Concluding remarks

DISCLOSURE STATEMENT

ACKNOWLEDGMENTS