TL;DR
This paper reviews the use of Markov Chain Monte Carlo methods for Bayesian data analysis in astronomy, covering foundational theory, various algorithms, and advanced techniques, with software tools provided.
Contribution
It offers a comprehensive overview of MCMC methods in astronomical Bayesian analysis, including new algorithms and practical software implementations.
Findings
Enhanced MCMC algorithms for complex astronomical data
Software tools available for practitioners
Discussion of future directions in Bayesian analysis
Abstract
Markov Chain Monte Carlo based Bayesian data analysis has now become the method of choice for analyzing and interpreting data in almost all disciplines of science. In astronomy, over the last decade, we have also seen a steady increase in the number of papers that employ Monte Carlo based Bayesian analysis. New, efficient Monte Carlo based methods are continuously being developed and explored. In this review, we first explain the basics of Bayesian theory and discuss how to set up data analysis problems within this framework. Next, we provide an overview of various Monte Carlo based methods for performing Bayesian data analysis. Finally, we discuss advanced ideas that enable us to tackle complex problems and thus hold great promise for the future. We also distribute downloadable computer software (available at https://github.com/sanjibs/bmcmc/ ) that implements some of the algorithms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\jvol
55 \jyear2017
Markov Chain Monte Carlo Methods for Bayesian Data Analysis in Astronomy
Sanjib Sharma1
Draft version. To appear in Annual Review of Astronomy and Astrophysics.
1Sydney Institute for Astronomy, School of Physics, University of Sydney, NSW 2006, Australia, email: [email protected]
Abstract
Markov Chain Monte Carlo based Bayesian data analysis has now become the method of choice for analyzing and interpreting data in almost all disciplines of science. In astronomy, over the last decade, we have also seen a steady increase in the number of papers that employ Monte Carlo based Bayesian analysis. New, efficient Monte Carlo based methods are continuously being developed and explored. In this review, we first explain the basics of Bayesian theory and discuss how to set up data analysis problems within this framework. Next, we provide an overview of various Monte Carlo based methods for performing Bayesian data analysis. Finally, we discuss advanced ideas that enable us to tackle complex problems and thus hold great promise for the future. We also distribute downloadable computer software (https://github.com/sanjibs/bmcmc/) that implements some of the algorithms and examples discussed here.
doi:
10.1146/((please add article doi))
keywords:
Methods: data analysis, numerical statistical
††journal: Annu. Rev. Astron. Astrophys.
Contents
-
1.1 Rise of MCMC based Bayesian methods in astronomy and science
-
4.1 Expectation maximization, data augmentation and Gibbs sampling
-
5.1 Exoplanets and binary systems using radial velocity measurements
-
5.2 Data driven approach to estimation of stellar parameters from a spectrum
-
5.4 Extinction mapping and estimation of intrinsic stellar properties
1 Introduction
Markov Chain Monte Carlo (MCMC) and Bayesian Statistics are two independent disciplines, the former being a method to sample from a distribution while the latter is a theory to interpret observed data. When these two disciplines are combined together, the effect is so dramatic and powerful that it has revolutionized data analysis in almost all disciplines of science, and astronomy is no exception. This review explores the power of this combination.
What is so special about MCMC based Bayesian data analysis? The usefulness of Bayesian methods in science and astronomy is easy to understand. In many situations, it is easy to predict the outcome given a cause. But in science, most often, we are faced with the opposite question. Given the outcome of an experiment what are the causes, or what is the probability of a cause as compared to some other cause? If we have some prior information, how does that help us? This opposite problem is more difficult to solve. The power of Bayesian theory lies in the fact that it provides a unified framework to quantitatively answer such questions. Hence it has become indispensable for science. As opposed to deductive logic, Bayesian theory provides a framework for plausible reasoning, a concept which is more powerful and general, an idea championed by Jaynes (2003) in his book.
The question now is how does one solve a problem that has been set up using Bayesian theory. This mostly involves computing the probability distribution function (pdf) of some parameters given the data and is written as . Here, need not be a single parameter; in general, it represents a set of parameters. Usually here and elsewhere, such functions do not have analytical solutions and so we need methods to numerically evaluate the distribution. This is where MCMC methods come to the rescue. They provide an efficient and easy way to sample points from any given distribution which is analogous to evaluating the distribution.
Bayesian data analysis (Jeffreys, 1939) and Markov Chain Monte Carlo (Metropolis et al., 1953) techniques have existed for more than 50 years. Their tremendous increase in popularity over the last decade is due to an increase in computational power which has made it affordable to do such computations.
The simplest and the most widely used MCMC algorithm is the “random walk” Metropolis algorithm (Section 3.2). However, the efficiency of this algorithm depends upon the “proposal distribution” which the user has to supply. This means that there is some problem-specific fine tuning to be done by the user. The problem to find a suitable proposal distribution becomes worse as the dimensionality of the space over which the sampling is done increases. Correlations and degeneracies between the coordinates further exacerbate the problem. Many algorithms have been proposed to solve this problem and it remains an active area of research. Some algorithms work better for specific problems and under special conditions but algorithms that work well in general are in high demand. Multimodal distributions pose additional problems for MCMC algorithms. In such situations, an MCMC chain can easily get stuck at a local density maximum. To overcome this, algorithms like simulated tempering and parallel tempering have been proposed (Section 3.8). Hence discussion of efficient MCMC algorithms is one focus of this review.
Given its general applicability, the Bayesian framework can be used in almost any field of astronomy. Hence, it is not possible to discuss all its applications. However, there are many examples where either alternatives do not exist or are inferior. The aim of this review is to specifically discuss such cases where Bayesian-MCMC methods have enjoyed great success. The Bayesian framework by itself is very simple. The difficult part when attempting to solve a problem is to express the problem within this framework and then to choose the appropriate MCMC method to solve it. The best way to master this is by studying a diverse set of applications, and we aim to provide this in our review (Section 5). Finally, we also discuss a few advanced topics like non-parametric models and hierarchical Bayesian models (Section 4) which are not yet main stream in astronomy but are very powerful and allow one to solve complex problems.
To summarize, the review has three main aims. The first is to explain the basics of Bayesian theory using simple familiar problems, e.g., fitting a straight line to a set of data points with errors in both coordinates and in the presence of outliers. This is targeted at readers who are new to the topic. The second goal is to provide a concise overview of recent developments. This will benefit people who are familiar with Bayesian data analysis but are interested in learning more. The final aim is to discuss emerging ideas that hold great promise in future. We also develop and distribute downloadable software (available at https://github.com/sanjibs/bmcmc/ or by running the command pip install bmcmc) implementing some of the algorithms and examples that we discuss.
1.1 Rise of MCMC based Bayesian methods in astronomy and science
The emergence of Bayesian statistics has a long and interesting history dating back to 1763 when Thomas Bayes laid down the basic ideas of his new probability theory (Bayes & Price, 1763, published posthumously by Richard Price). It was rediscovered independently by Laplace (de Laplace, 1774) and used in a wide variety of contexts, e.g., celestial mechanics, population statistics, reliability, and jurisprudence. However, after that it was largely ignored. A few scientists like, Bruno de Finetti and Harold Jeffreys kept the Bayesian theory alive in the first half of the 20th century. Harold Jeffreys published the book Theory of Probability (Jeffreys, 1939), which for a long time remained the main reference for using the Bayes theorem. The Bayes theorem was used in the Second World War at Bletchley Park, United Kingdom, for cracking the German Enigma code, but its use remained classified for many years afterwards. From 1950 onwards, the tide turned towards Bayesian methods. However, the lack of proper tools to do Bayesian inference remained a challenge. The frequentist methods in comparison were simpler to implement which made them more popular. Recent statement by the American Statistical Association, (Wasserstein & Lazar, 2016) warning on the misuse of P values is another example of the superiority of the Bayesian methods of hypothesis testing.
Interestingly, efficient methods like MCMC to sample distributions had been invented by 1954 in the context of solving problems in statistical mechanics (Metropolis et al., 1953). (The brand name Monte Carlo was coined by Metropolis & Ulam (1949) where they discussed a stochastic method making use of random numbers to solve a class of problems in mathematical physics which are difficult to solve due to the large number of dimensions.) Such problems typically involve interacting particles. A single configuration of such a system is fully specified by giving the position and velocity of all the particles; i.e., can be defined by a point in space, also known as the configuration space . The total energy is a function of the configuration . For a system in equilibrium, the probability of a configuration is given by , where is the Boltzmann constant and is the temperature of the system. Computing any thermodynamic property of the system, e.g., pressure or energy typically involves computing integrals of the form
[TABLE]
for which is known as the partition function. The integrals over are in most cases analytically and computationally intractable. The idea of Metropolis and colleagues was to start with an arbitrary configuration of particles and then move each particle by a random walk. If , the move is always accepted, otherwise, it is accepted stochastically with probability , which is the ratio of the probability of the new configuration with respect to the old. The method ends up choosing a configuration sampled from . The method immediately became popular in the statistical physics community.
However, the fact that the same method can be used for sampling an arbitrary pdf by simply replacing with had to wait till the important paper by Hastings (1970). He generalized the work of Metropolis and colleagues and derived the essential condition for the acceptance ratio that a Markov chain ought to satisfy in order to sample the target distribution. The generalized algorithm is now known as the Metropolis-Hastings (MH) algorithm. Later Hastings’s student Peskun showed that, among the available choices, the one by Metropolis and colleagues was the most efficient (Peskun, 1973). Despite its introduction to the statistical community, the ideas remained dormant till 1980.
Around 1980 things suddenly changed and a few influential algorithms appeared. Simulated annealing was presented by Kirkpatrick et al. (1983) to solve combinatorial optimization problems using the MH algorithm in conjunction with ideas of annealing from solid state physics. It is especially useful for situations where we have multiple maxima and applies to any setting where we have to minimize an objective function . This is done by sampling , with progressively decreasing to allow annealing and selection of a globally optimum solution. A year later Geman & Geman (1984) introduced what we currently know as “Gibbs sampling” in the context of image restoration. This was the first proper use of MCMC techniques to solve a problem set up in a Bayesian framework, in the sense that simulating from conditional distributions is the same as simulating from the joint distribution. However, there exists earlier work related to Gibbs sampling; the Hammersley-Clifford theorem which was developed in the early 1970s and the work by Besag (1974).
At about this time, one of the most influential methods of the 20th-century emerged the expectation maximization (EM) algorithm by Dempster, Laird & Rubin (1977). This provided a way to deal with missing data and hidden variables and vastly increased the range of problems that can be addressed by Bayesian methods. The EM algorithm is deterministic and has some sensitivity to the starting configuration. To address this, stochastic versions were developed (Celeux & Diebolt, 1985) quickly followed by the data augmentation (DA) algorithm (Tanner & Wong, 1987).
The watershed moment in the field of statistics is largely credited to the paper by Gelfand & Smith (1990) that unified the ideas of Gibbs sampling, DA and the EM algorithm (Tanner & Wong, 2010; Robert & Casella, 2011). It firmly established that Gibbs sampling and Metropolis-Hastings based MCMC algorithms can be used to solve a wide class of problems that fall into the category of hierarchical Bayesian models. The citation history of the famous Metropolis et al. (1953) paper shown in Figure 1 corroborates the historical narrations on this topic. In physics, the MH algorithm was well known in the period 1970-1990, but this was not so in statistics or astronomy. In astronomy, a watershed moment can be seen in 2002; this is visible more clearly in Figure 2 where we track the usage of the words MCMC and Bayesian.
But prior to 2002, the Bayesian-MCMC technique was not unknown to the astronomy community. We can see its use in Saha & Williams (1994) who applied it to extract galaxy kinematics from absorption line spectra. Further seeds were planted down the line by Christensen & Meyer (1998) while studying gravitational wave radiation, and then by Christensen et al. (2001) and Knox, Christensen & Skordis (2001) in the context of cosmological parameter estimation using cosmic microwave background data. Inspired by these papers, Lewis & Bridle (2002) more than any other paper seems to have galvanized the astronomy community in the use of Bayesian and MCMC techniques. They laid out in detail the Bayesian-MCMC framework, applied it to one of the most important data sets of the time (cosmic background radiation) and used it to address a significant scientific question the fundamental parameters of our universe. Additionally, they made their MCMC code publicly available, which was instrumental in lowering the barrier for new entrants to the field.
2 Bayesian Data Analysis
In this section we briefly review the basics of the Bayesian theory. We start with the Bayes theorem and then use it to set up the problem of fitting a model to data. This is followed by a discussion of the role of priors in Bayesian analysis. Next, the Bayesian solution of fitting a straight line is discussed in detail to illustrate the ideas discussed. Finally, we show how to perform model selection. To further explore the topics discussed here, many excellent resources are available. A stimulating discussion on Bayesian theory can be found in Jaynes (2003). Sivia & Skilling (2006) and Gregory (2005) are excellent textbooks with a good emphasis on applications in science. Hogg, Bovy & Lang (2010) provides lucid tutorial on fitting models to data. A fascinating discussion on Bayesian versus frequentist approaches to solving problems can be found in Loredo (1990). A review with emphasis on cosmology is given by Trotta (2008).
2.1 Bayes’ Theorem
Cox (1946) showed that the rules of Bayesian probability theory can be derived from just two basic rules:
[TABLE]
Here stands for some proposition being true and stands for some other proposition being true, and means the proposition is false. So the sum rule just states that the probability of a proposition being true plus the probability of it being false is unity. The product rule expresses the joint probability of two propositions being true in terms of conditional probabilities, one being true given the other is true. The vertical bar is a conditioning symbol and means ‘given’. denotes relevant background information that is used to construct the probabilities. The product rule leads to the Bayes Theorem
[TABLE]
where we identify with the hypothesis and with the data. The is the probability of observing the data if the hypothesis is true and is known as the likelihood. The quantity is the prior and specifies our prior knowledge of being true. The , known as posterior, expresses our updated belief about the truth of the hypothesis in light of the data . The quantity is a constant and serves the purpose of normalizing to 1. It is known as the evidence.
Another important result that can be derived from the sum rule and the product rule is the marginalization equation,
[TABLE]
First let us write the sum rule in an alternate form. Instead of considering just and , we consider a set of possibilities that are mutually exclusive.
[TABLE]
Now, making use of the product rule and the sum rule we get
[TABLE]
2.2 Fitting a model to data
Typically, we have some data and we want to use it for scientific inference. One of the most effective approaches to dealing with such problems is to develop a model that describes how the data were created. Let be the set of parameters of the model and a data point generated by the model according to . The observed data points can have some measurement errors, described by a parameter . The probability of the observed value is then given by , which could be for Gaussian errors; hereafter, refers to a normal distribution with mean and variance . The probability of observed data point given a model and an error is then
[TABLE]
We have integrated over true values which are unknown.
If we have reason to believe that there are outliers in the data, e.g., a fraction of points are not described by the model, we can supplement a background model with probability and parameters (Press, 1997; Hogg, Bovy & Lang, 2010). The probability of the observed data points can then be written as,
[TABLE]
The total probability for a set of data points is then
[TABLE]
To infer the model parameters, one uses the Bayes theorem and computes
[TABLE]
Here, represents our prior knowledge about the parameters. We discuss this in detail in the next section.
We consider the problem of fitting a straight line with equation to some data points and , with uncertainty on the ordinate. We generated 50 data points with and ; 20% of the data points were set as outliers and were sampled from . To simulate random uncertainty the ordinate was scattered with a Gaussian function having dispersion in range . The data along with the results of our fitting exercise are shown in Figure 3. The image shows the outliers and data sampled from a straight line. We first fitted a simple model without taking the outliers into account (dashed line). Here, and the generative model of the data is
[TABLE]
It can be seen that the “best fit” line is not a good description for the data points that were sampled from a straight line. Next, we extended the model by adding a model for the outliers as
[TABLE]
The full model being
[TABLE]
The best-fit line resulting from this model obtained by sampling the posterior distribution using a Markov Chain Monte Carlo scheme is shown in Figure 3. The best-fit parameters of the model resemble well the true parameters that were used to create the synthetic data set (the example is implemented in the software that we provide).
2.3 Priors
Priors are one of the most important ingredients of the Bayesian framework. Priors express our present state of knowledge about the parameters of interest, which we wish to constrain by analyzing new data. In a multi-dimensional parameter space, it is quite common to have degeneracies among the parameters. Here priors can play a crucial rule in restricting the posterior to a small region of the parameter space as compared to the much larger region allowed by the likelihood function. Priors can be broadly classified into two types, uninformative and informative. Uninformative priors express our state of ignorance and have very little restricting power. They are also known as ignorance prior. Typically their distributions are diffuse. Informative priors on the other hand By contrast, informative priors are typically very restricting. They might come from the analysis of some previous data. They are important when the data alone are not very informative and without external information the data cannot adequately constrain the parameters being investigated.
Ignorance priors are used in cases where we have very little knowledge about the parameters we want to constrain, and we wish to express our ignorance by using uninformative priors. Certainly a prior with sudden jumps or oscillating features is too detailed for expressing ignorance! So smoothness is certainly an important criterion for an ideal uninformative prior. In fact, if the data are informative, almost any prior that is sufficiently smooth in the region of high likelihood will lead to very similar conclusions. Is there a formal and unique way to express our ignorance?
A number of techniques exist for constructing ignorance priors. We here discuss a few simple and commonly used ones; for a detailed review see Kass & Wasserman (1996). The simplest is Laplace’s principle of insufficient reason which assigns equal probability to all possible values of the parameter. If the parameter space consists of a finite set of points, then it is easy to apply this principle. However, for a continuous parameter space, the prior depends upon the chosen partitioning scheme.
Ignorance priors can also be specified using the invariance of the likelihood function, , under the action of a transformation group , e.g., translation, scaling or rotation of coordinates. If the priors are really uninformative, consistency demands that we should make the same Bayesian inference, which implies that the priors should also be invariant to the transformation and satisfy (Jaynes, 2003). For two special types of parameters, this leads to unique choices for expressing ignorance. These are the location parameters and the scale parameters. An example is the mean and dispersion of a normal distribution which are the location and the scale parameters respectively. The likelihood is invariant under transformation , demanding invariance for the prior leads to . Similarly, is also invariant under , which leads to . In general, and are location and scale parameters if likelihood is of the form .
Another commonly used technique to specify ignorance priors is the Jeffreys rule,
[TABLE]
is the Fisher information matrix and a vector of parameters. It is based on the idea that the prior should be invariant to reparameterization of . Applying it to the case where the likelihood is a normal distribution , gives (for a fixed ) and (for a fixed ). However, when applied to both and together, it gives . To avoid this contradiction the rule was modified to
[TABLE]
where are location parameters and is calculated keeping them fixed.
The principle of maximum entropy (Jaynes, 1957) is also helpful for selecting priors. Suppose we are interested in knowing the pdf of a variable, e.g., the probability of a given face of a six-faced die landing up. Suppose we also have some macroscopic constraint available to us, e.g., the mean value obtained when the die is rolled a large number of times. Such a constraint cannot uniquely identify a pdf but can be used to rule out a number of pdfs. The principle says that out of all possible pdfs satisfying the constraint, the most likely one is the one having maximum entropy, where the entropy is defined as . We now use this principle to derive the most likely distribution of a variable for two common cases.
- •
If for a variable we know the expectation value and the fact that it lies in the range then the maximum entropy distribution of is
- •
If and variance are known, then
2.4 Fitting a straight line
We now consider the Bayesian solution for fitting a straight line in detail (see also Jaynes, 1991; Hogg, Bovy & Lang, 2010). We first discuss the solution for the general case where we have uncertainties on both and coordinates and then discuss the case where the uncertainties are unknown. Suppose we have a collection of points , , with uncertainties . Here is the covariance matrix defined as
[TABLE]
We want to fit a line to these data. For the time being, we assume to be a diagonal matrix with . Let be the true values corresponding to the point . Then the probability of measuring the point at is
[TABLE]
Let us consider a generative model for the line. We consider the pdf of a line to be described by a Gaussian with width along a direction perpendicular to the line and width along the line. Here, can be thought of as an intrinsic scatter about the linear relation that we wish to investigate. In the limit , the probability of a point to be sampled from this generative model is
[TABLE]
Hence, the probability of being sampled from the generative model of the line is
[TABLE]
where is the component of the error vector perpendicular to the line, and is the perpendicular distance of the point from the line. For a general matrix , for which is a unit vector perpendicular to the line. For the full sample,
[TABLE]
If we desire to compute and , then
[TABLE]
Henceforth, is a normalization constant which may be different in different equations. The is the prior distribution of parameters of the line. The two common choices for the prior are the uniform (flat) and Jeffreys prior. Neither is appropriate. Given the rotational symmetry in the problem, a sensible choice is to have priors that are symmetric with respect to rotation. Let be the angle made by the line with axis, and be the distance of the line from the origin. A uniform prior on and is symmetric with respect to rotation. This leads to
[TABLE]
In Figure 4, we graphically show how a prior uniform in differs from a prior uniform in . In the left panel, we show straight lines uniformly spaced in . The lines tend to crowd at high value of , and this can bias the estimate of the slope . In the right panel, the lines are uniformly spaced in , and there is no crowding effect.
The log-likelihood of the full solution after taking the prior into account is
[TABLE]
We now study the case where is unknown and . For simplicity, we assume the uncertainty is the same for all data points, i.e., .
[TABLE]
and integrate over using Jeffreys prior to arrive at
[TABLE]
So, if we ignore the prior factor, the best fit line is simply the line that minimizes the sum of the squared perpendicular distances of points from the line.
2.5 Model comparison
When we have multiple models to explain data, we are faced with the question of which model is better. There is no unique definition of better and depending upon what we mean by better we can come up with different criteria to compare models. We have two main schools of thought, a) to compare the probability of the model given the data and b) to compare the expected predictive accuracy of the model for the future data. The former is inherently a Bayesian approach and is known as Bayesian model comparison. The latter is inspired by frequentist ideas but can also be argued from a Bayesian perspective (Vehtari & Ojanen, 2012; Gelman, Hwang & Vehtari, 2014).
2.5.1 Bayesian model comparison
In the Bayesian formulation, the usefulness of a model is indicated by the probability of a model given the data ,
[TABLE]
The prior model probability is generally assumed to be unity. Note in some cases it may not be so, and we might have more reason to believe one model over the other. The is the same for all models, so it is irrelevant when comparing models. Thus the main thing we need to compute is the evidence (also know as marginal likelihood). Hence, for two models and , the odds ratio in favor of compared to is mainly determined by the ratio of their evidences, , also known as the “Bayes factor” (for a review and a guide to interpreting the Bayes factor, see Kass & Raftery, 1995).
[TABLE]
For some given data and a model parameterized by , we have
[TABLE]
The evidence appears as the denominator on the right hand side and can be obtained by integrating both sides of Equation (36) over all . For properly normalized quantities, the left hand side integrates to unity, leading to .
Note, the Bayes factor depends upon the adopted range of the prior which leads to some conceptual difficulties (see the paradox in Lindley, 1957). The range of prior is not an issue for parameter estimation but it is for model selection; we cannot use improper priors. In most cases, we do have a reasonable sense of the range of priors and they are unlikely to extend to infinity. To better understand the role of priors, consider two models and , where has a free parameter , while has no free parameter (with being fixed to ). Let be the characteristic width of the likelihood distribution and the range of a uniform prior which encloses the likelihood peak. The Bayes factor in favor of model as compared to is then
[TABLE]
The first term on the right hand side will in general be greater than one and will favor , as the simpler model is a special case of . However, the second term penalizes if it has a large range in priors.
The conceptual difficulty associated with the dependence of the Bayes factor on the adopted prior range is alleviated if one thinks of hypothesis as a specification of a model as well as the prior on its parameters. A model with a larger range in priors allows for a larger number of possible data sets consistent with the hypothesis as compared to a simpler model with narrow range of prior. Hence , being a normalized probability over possible data sets, will be lower as compared to (MacKay, 2003). Also, is more precise as a hypothesis as compared with .
If we have more free parameters in a model, the penalty term in the Bayes factor will be higher, being of the form . In this sense, the Bayes factor has a built-in safeguard to prevent overfitting (a model with a large number of free parameters will fit a given set of data better but will perform poorly when presented with new data).
{marginnote}\entry
BICBayesian information criterion \entryWBICWidely applicable Bayesian information criterion
Computing the Bayes factor or the Bayesian evidence is computationally challenging. Generally, the likelihood is peaked and confined to a narrow region in the prior range, but has long tails whose contributions cannot be neglected. Some commonly employed numerical techniques are (1) simulated annealing, (2) nested sampling, (3) Laplace’s approximation, (4) Lebesgue integration theory (Weinberg, 2012), and (5) the Savage-Dickey density ratio (Verdinelli & Wasserman, 1995). Two useful approximations of the Bayes free energy are
[TABLE]
Here, denotes expectation taken over the posterior distribution of . The case of corresponds to the Bayesian estimation of the posterior. The posterior can be sampled using an MCMC algorithm Assuming weak priors and that the posterior is asymptotically normal we have . WBIC is an improved version of BIC, which is also applicable for singular statistical models where BIC fails. A model is singular if the Fisher information matrix is not positive definite, which typically occurs when the model contains hierarchical layers or has hidden variables.
2.5.2 Predictive methods for Model comparison
A statistical model can be thought of as an approximation of the true distribution from which the observed data were generated. represents a set of independently observed data points such that . The Bayesian predictive distribution can then be defined as , while the maximum likelihood estimate is given by . Predictive methods judge models by their ability to fit future data , e.g., via the log-likelihood function . Given that we do not have future data, the idea is to measure out-of-sample-prediction error from the sample at hand. Cross validation is a natural way to do this, where we divide the current data set into training and testing samples. But this is computationally costly. Hence, alternate criteria have been developed. We start by computing the training error . However, this is a biased estimator of as the data are used twice, once to estimate the model and once more to compute the log likelihood of the data. If we have more parameters in the model, it will certainly fit the given data better but will also give rise to larger variance in the estimator, and we need to penalize the model for this. This variance, which represents the effective degrees of freedom in the model, can be calculated from the data and the model. A list of some useful information criteria based on the above idea are given below. They can be easily computed using samples of obtained by an MCMC simulation of the posterior . {marginnote} \entryAIC Akaike information criterion \entryDICDeviance information criterion \entryWAICWidely applicable Bayesian information criterion
[TABLE]
Here, denotes variance taken over the posterior distribution of . The first term is a measure of how well the model fits the observed data while the second term is a penalty for the degrees of freedom in the model.
In general, the predictive criteria have a well-defined information-theoretic interpretation (Burnham & Anderson, 2002; Watanabe, 2010). Specifically, the expected value of AIC and WAIC, is equivalent to the expected Kullback-Leibler divergence of the predictive distribution from the true distribution, the expectation is taken over the random realizations of the observed data set , which samples the true distribution . Also, in the asymptotic limit of large sample size, both AIC and WAIC are equivalent to leave-one-out cross-validation (LOOCV).
An extra parameter in a model need not necessarily contribute to extra variance in the predictive density, e.g., if we have informative priors on the parameter, the likelihood has a very weak dependence on the parameter or if the model is hierarchical then multiple parameters might be restricted. The use of AIC can be problematic in such cases. DIC and WAIC overcome this problem by estimating the effective degrees of freedom directly from the likelihood function of the data and samples of obtained from the posterior .
WAIC offers some additional advantages as compared to AIC and DIC. AIC and DIC use a point estimate for when computing predictive density, whereas WAIC uses the Bayesian predictive density. If a model is singular, criteria such as AIC, DIC and BIC do not work well. In contrast, WAIC works for such cases, and in the asymptotic limit of large sample size, WAIC is always equivalent to Bayesian LOOCV.
It is instructive to study the differences between BIC and AIC, as they represent two very different approaches to the problem of model selection (for a detailed discussion, see Burnham & Anderson, 2002). Due to the presence of the term, for the BIC penalizes free parameters more heavily as compared to AIC. So BIC is more parsimonious or cautious when it comes to admitting new parameters in a model. In situations where two models can give rise to the same predictive distribution, BIC will favor the model with fewer degrees of freedom while AIC will treat them equally. An example is nested models, where a simpler model can be considered as a special case of a complex model but with few of its parameters being fixed. Interestingly, AIC can also be argued to be using the approach of Bayes factors, but with a prior whose variance decreases with sample size , whereas BIC would correspond to the choice of a weak prior with fixed variance (Smith & Spiegelhalter, 1980).
To conclude, the Bayesian and the predictive methods both have their strengths and weaknesses. If the choice of priors is well justified, then the methods based on Bayes factor are best suited for model selection. However, if our aim is best predictive accuracy for future data, predictive methods like WAIC are a better choice.
3 Monte Carlo methods for Bayesian computations
Having discussed how to set up problems in the Bayesian framework, we now discuss methods to perform the inference, i.e., how to estimate the pdf of parameters given the data. Except for some simple cases, closed form analytical solutions are in general not available. So one makes use of Monte Carlo based methods to sample from the desired distribution. The most popular method to do this today is the Markov Chain Monte Carlo (MCMC) method. MCMC is a class of methods for sampling a pdf using a Markov chain whose equilibrium distribution is the desired distribution. Once we have a sample distributed according to some desired distribution, we can compute expectation values and integrals of various quantities in a process analogous to Monte Carlo integration. The word Monte Carlo in MCMC comes from the use of random numbers to drive the Markov process and the close analogy to Monte Carlo integration schemes. Note in conventional Monte Carlo integration, the random samples are statistically independent whereas in MCMC they are correlated. We first broach the theory behind Markov chains and then discuss specific MCMC methods based on it.
3.1 Markov Chain
A Markov chain is a sequence of random variables such that, given the present state, the future and past are independent. It is formally written as
[TABLE]
In other words, the conditional distribution of in future, depends only upon the present state . If the probability of transition is independent of , it is a time-homogeneous chain. Such a chain is defined by specifying the probabilities of transitioning from one state to another. To simplify mathematical notation, we sometimes consider the state space to be continuous and sometimes discrete. But the presented results are equally valid for either type of spaces. For a continuous state space where a probability density can be defined we can write the transition probability as
[TABLE]
For a discrete state space the transition probability is a matrix and is written as . On a given state space, a time-homogeneous Markov chain has a stationary distribution (invariant measure) if
[TABLE]
A Markov chain is irreducible if it can go from any state of a discrete state space to any other state in a finite number of steps, i.e., there exists an integer such that . If a chain having a stationary distribution is irreducible, the stationary distribution is unique, and the chain is positive recurrent. For an aperiodic, positive recurrent chain with stationary distribution , the distribution is limiting (equilibrium distribution). This means if we start with any initial distribution (a row vector specifying probability over states of a discrete state space) and apply the transition operator (a matrix) many times, the final distribution will approach the stationary distribution (a row vector),
[TABLE]
For an irreducible Markov chain with a unique stationary distribution , there is a law of large numbers which says that the expectation value of a function over approaches the average taken over the output of a Markov chain,
[TABLE]
This property allows one to compute Monte Carlo estimates of specific quantities of interest from a Markov chain. Techniques that do this are known as Markov chain Monte Carlo or MCMC.
A chain having a stationary distribution is said to be reversible if the chain starting from a stationary distribution looks the same when run forward or backward in time. In other words, if has distribution then the pair has the same joint distribution as .
[TABLE]
For the transition kernel this means
[TABLE]
and is known as the condition of detailed balance. For a Markov chain, it is not necessary to satisfy reversibility in order to have a stationary distribution. However, reversibility guarantees the existence of a stationary distribution, and is thus a stronger condition. This is the reason that most MCMC algorithms are designed to satisfy detailed balance.
3.2 Metropolis Hastings algorithm
The most general MCMC algorithm is the Metropolis-Hastings (MH) algorithm (Metropolis et al., 1953; Hastings, 1970). Suppose we are interested in sampling a distribution on a state space , with . To construct a transition kernel to go from to , MH algorithm uses a two step process:
- •
Specify a proposal distribution .
- •
Accept draws from with acceptance ratio .
So the transition kernel is given by . The full algorithm is as follows
The transition kernel of the MH algorithm is reversible and satisfies detailed balance, . Note the reversibility condition by itself does not lead to a unique form for the acceptance ratio and alternatives exist (Barker, 1965). However, it has been shown that the acceptance ratio of the MH algorithm results in a chain with the fastest mixing rate (Peskun, 1973).
There are multiple ways to construct the proposal distribution each leading to a new version of the MH algorithm.
- •
Symmetric Metropolis: which simplifies the acceptance probability to ; this is the version that was proposed by Metropolis and colleagues.
- •
Random walk Metropolis-Hastings (RWMH): ; the direction and distance of the new point from the current point is independent of the current point. Common choices are and .
- •
Independence sampler: ; i.e., the new state is drawn independent of the current state. The acceptance probability is given by , a generalization of the accept-reject algorithm. The quantity should resemble but with longer tails.
- •
Langevin algorithm: ; this is useful when the gradient is available.
Except when (uniform target density), the mean of the acceptance ratio is always less than unity. Decreasing in the RWMH algorithm increases but lowers the independence of the sampler. Increasing improves the independence but lowers . In the Langevin algorithm, one makes use of the information in the gradient to allow faster mixing of the chain.
3.3 Gibbs sampling
The Gibbs sampler introduced by Geman & Geman (1984) is one of the most popular computational methods for doing Bayesian computations. Suppose we want to sample where . In Gibbs sampling, the transition kernel is split into multiple steps. In each step, one coordinate is advanced based on its conditional density with respect to other coordinates. The algorithm is as follows:
The full transition kernel is written as,
[TABLE]
Similarly one can define a reverse move,
[TABLE]
It can be easily shown that
[TABLE]
Integrating both sides leads to
[TABLE]
Thus, is the stationary distribution of the Markov chain formed by the transition kernel . Note the Gibbs sampler as given above (systematic scan) is not reversible. However, the reversible ones can easily be produced, e.g., at each iteration picking a random component to update (random-scan). The random-scan Gibbs sampler can be viewed as a special case of MH sampler with acceptance ratio . It follows that
[TABLE]
Here, and , as only the -th component is changed in each step.
3.4 Metropolis within Gibbs
One problem with the Gibbs sampler is that it requires one to sample from the conditional distributions which can be difficult. In such cases, one can replace the sampling of conditional densities with the MH step. This then becomes the Metropolis within Gibbs (MWG) scheme (see Müller, 1991), which is shown in Algorithm 3 (it is implemented in the code that we provide).
Rather than updating all the variables step by step, one can also choose to update a subset of variables together, leading to block updates. The fact that the full sampling of a complicated distribution can be broken up into a sequence of smaller and easier samplings, is the main strength of the Gibbs sampler and has resulted in its widespread use (e.g. Sale, 2012; Sharma et al., 2014).
3.5 Adaptive Metropolis
The efficiency of the MH algorithm depends crucially upon the proposal distribution. By efficiency we typically mean how independent are the samples. If the samples are not independent then they have high correlation. For Markov chains, the correlation falls off with distance between samples. If the correlation is large, this means the mixing in the chain is slow. If the width of the proposal distribution is too small, the acceptance ratio is high but the chain mixes very slowly. If the width of the proposal distribution is too large, the acceptance ratio is too small and the chain again mixes slowly (see Figure 5 for an illustration of this effect). Gelman, Roberts & Gilks (1996) showed that optimal covariance matrix for the RWMH algorithm using the multivariate normal distribution is , where is the dimensionality of the space and is the covariance matrix of the target distribution . The optimal acceptance ratio is 0.44 for dimension and then falls off with increasing number of dimensions reaching an asymptotic value of for . The convergence is quite fast ([0.441, 0.352, 0.316, 0.279, 0.275, 0.266] for [1, 2, 3, 4, 5, 6]). The efficiency as compared to independent samples is .
These results suggest a possible way to choose the optimal proposal distribution. Estimate the covariance matrix by a trial run and then use it for the actual run. Even doing this is cumbersome as it is unclear how long the trial run should be. To circumvent this, Haario, Saksman & Tamminen (2001) proposed an adaptive scheme in which is updated on the fly using past values. Naively, any scheme that uses proposals that depend upon the full past history violates the Markovian property, i.e., the future should only depend on the present and should be independent of the past. The trick is to adapt the proposal distribution in such a way that it converges to the optimal one. The resulting chain then also converges to the target distribution. Andrieu & Robert (2001) showed that such a scheme can be described as part of a more general adaptive framework.
At the heart of most adaptive algorithms is the Robbins & Monro (1951) recursion. They proposed an iterative stochastic algorithm to find roots of functions that are stochastic, i.e., their algorithm solves , where instead of the function available is , which is stochastic and is such that . Starting with some initial value the algorithm to get the th iterate is
[TABLE]
Here is a sequence of positive steps. The then converge to the true solution provided the sequence satisfies
[TABLE]
The first condition makes sure that irrespective of where we start, the solution can be reached in a finite number of steps. The second condition makes sure that we do converge. A possible choice of is where .
A nice description of various adaptive algorithms is given by Andrieu & Thoms (2008). Below we discuss Algorithm 4 from their paper which is quite general and is implemented in the software that we provide.
If is too small, the convergence is too slow; if is too large the convergence is too fast and the simulation can quickly lean towards a wrong solution and will take a long time to get out of it. For adaptive MCMC, we find a choice of to be satisfactory for most test cases. Figure 5d shows an adaptive MCMC chain obtained using Algorithm 4. The adaptive chain looks very similar to the ideal case shown in Figure 5b, and this demonstrates the usefulness of the adaptive MCMC scheme.
3.6 Affine invariant sampling
An elegant solution to the problem of tuning the proposal density is to use the idea of ensemble samplers (Gilks, Roberts & George, 1994). Here multiple chains (walkers) are run in parallel but allowed to interact in such a way that they can adapt their proposal densities. Goodman & Weare (2010) provide a general purpose algorithm to do this, known as the affine invariant sampler (see also Christen, 2010). A python implementation of this (emcee: the MCMC hammer, http://dan.iel.fm/emcee/current/) is provided by Foreman-Mackey et al. (2013) and is widely used in astronomy. We now describe this algorithm.
We saw in the previous section that adapting the proposal density can violate the Markovian property of a chain. The trick lies in using the information available in the ensemble but in a way that does not violate the Markovian property. This is achieved by using the idea of partial resampling which is a generalized version of the Gibbs sampling procedure. Let us consider an ensemble of walkers and a Markov chain that walks on a product space with distribution . Then if is updated conditional on other walkers (complementary set of walkers), but satisfying detailed balance , then each walker samples from .
One way to do this is to choose a point from and a scalar with density , and propose a new point as
[TABLE]
The inverse transformation is given by . Now if we want the proposal to be symmetric then , and this implies . A good choice of such a function is
[TABLE]
To satisfy detailed balance, the acceptance probability is given by . The factor is because the proposal is restricted along a line and not the full hypersphere over the actual space. This means an appropriate Jacobian has to be calculated; for details see Gilks, Roberts & George (1994) and Roberts & Gilks (1994) (the proof is much easier when using the reversible jump MCMC formalism of Green (1995)).
Moves other than the stretch move can also be constructed, e.g. a proposal , where has a covariance computed from a subset of walkers in the complementary sample. It is also possible to construct algorithms which use a combination of both the stretch and the walk move. Although the Goodman & Weare (2010) affine invariant algorithm elegantly solves the problem of choosing a suitable proposal distributions, it has one drawback. The computational cost of warm-up scales linearly with the number of walkers. Note, like most other MCMC algorithms, multimodal distributions (distributions with many well separated peaks) also pose a problem for this algorithm.
3.7 Convergence Diagnostics
Having studied MCMC methods in order to sample from distributions, we now discuss how to detect convergence; i.e., how long should we run an MCMC chain. Several convergence diagnostics have been proposed in the literature. Cowles & Carlin (1996) provide a good review of 13 convergence diagnostics. Other reviews include Brooks & Gelman (1998) and Robert & Casella (2013). Unfortunately, because there is no method to detect convergence, we can only detect failure to converge. So convergence diagnostics are necessary conditions but not sufficient. Below we present two schemes to monitor convergence. The first scheme makes use of the correlation length of the chain to compute the effective number of independent samples in a chain. The second scheme makes use of multiple chains to see if they are converging.
3.7.1 Effective sample size
Let us begin by estimating how many independent samples we need to get reliable estimates of mean and variance of a quantity. For a posterior of some variable with standard deviation , the Monte Carlo standard error goes as for sample of size . So to measure the mean of a quantity with about 3% error as compared to the overall uncertainty we need . Raftery & Lewis (1992) showed that to measure quantile to within with probability 0.95 requires about 4000 independent samples.
However, the MCMC is not an independent sampler. As we have seen, the points in an MCMC chain are correlated. Autocorrelation provides a measure of this. Autocorrelation for a sequence is the correlation between two points separated by a fixed distance ; i.e.
[TABLE]
An automatic windowing procedure is discussed by Sokal (1997) for the computation of integrated autocorrelation (see also Goodman & Sokal, 1989; Goodman & Weare, 2010). Typically the autocorrelation falls off exponentially as and is known as the correlation time (or correlation length). The integrated autocorrelation is defined as . The variance of the mean of for a sample of size can be shown to be
[TABLE]
So for correlated samples the variance is times larger than the variance of independent samples. Using , one can measure the number of effective independent samples in a correlated chain also known as the effective sample size (ESS) as and then use it to decide if we have enough samples (e.g., ESS).
3.7.2 Variance between chains
The most widely used criterion for studying convergence was first presented by Gelman & Rubin (1992). Let us suppose we have chains each consisting of iterations out of which we use only the last iterations. For any given scalar parameter of interest , let
[TABLE]
The index runs over points in a chain, and the index runs over the chains. Then the between chain variance and the mean within chain variance can be written as
[TABLE]
The total variance for the estimator can be written as a weighted average of and , . If we account for the sampling variability of the estimator , then this yields a pooled variance of
[TABLE]
for the mixture of chains. If the initial distribution is over-dispersed, then and always overestimates the true variance . For any finite , is expected to be less than , as individual sequences in a chain would not have had the time to explore the full target distribution. So, initially we expect . However, in the limit , the variance between chains, which is expected to fall off as , goes to 0 and will approach the true variance , making approach 1. Therefore the ratio , also known as the potential scale reduction factor, can be used to monitor the convergence.
3.7.3 Thinning
For making inferences from an MCMC chain, some algorithms use only the -th iteration of each sequence such that successive draws are approximately independent, a process known as thinning. However, there is no additional advantage of thinning other than savings in storage. Since we are throwing away information, an estimate from a thinned chain can never be better than the original chain (Geyer, 1992; MacEachern & Berliner, 1994). Moreover, it is difficult to choose an appropriate without studying the autocorrelation of the full chain. So thinning is useful only in situations where the autocorrelation is known a priori and is known to be large. Here again should be chosen such that it is smaller than the autocorrelation length, to retain as much information as possible.
3.8 Parallel Tempering
Multimodal distributions in general pose problems for all MCMC algorithms. Parallel tempering is one way to address this problem. It is a type of ensemble sampler where multiple chains are simulated in parallel but are allowed to exchange information. Each chain has a target distribution different from the other and is controlled by a parameter known as the temperature. Let be the actual target distribution, then a ladder of distributions
[TABLE]
is created, controlled via the parameter , such that . is set to 1. Hence, represents the target distribution. The temperature broadens the target distribution and allows a wider exploration of the parameter space which makes it useful to explore multimodal distributions. To exchange information between the chains, a state swapping procedure is used. A swap is proposed between a randomly chosen chain and its neighbors and with probability and . Naively, accepting the swap will violate the detailed balance condition. So the swap proposal is accepted with probability
[TABLE]
which satisfies detailed balance.
In parallel tempering the temperature ladder needs to be chosen carefully. If the neighboring temperatures are too far apart, the acceptance rate will be diminished leading to slow mixing. If the neighboring temperatures are too close, a large number of elements in the ladder will be required to explore a wide range in parameter space, and this can increase the computational cost significantly. However, by exploiting the trial runs, a suitable ladder can be constructed (Liang, Liu & Carroll, 2011).
The idea of parallel tempering can be generalized to construct evolutionary algorithms that incorporate features of genetic algorithms into the framework of MCMC. The basic idea is to have parallel chains as in parallel tempering and allow exchange of information while satisfying detailed balance on the product space defined by the chains. The exchange of information is based on ideas of mutation and crossover from genetic algorithms (Liang & Wong, 2001a, b).
3.9 Monte Carlo Metropolis Hastings
In MCMC based Bayesian inference, we are concerned with simulating samples from some pdf . However, there are situations when cannot be easily evaluated or is not available in an analytically tractable form. In such situations one can make use of Monte Carlo based techniques to approximately evaluate . More generally, the Metropolis Hastings ratio is used to update an MCMC chain. In such techniques, typically, one generates a set of auxiliary samples conditioned on and then uses them to compute (an approximation of ) or (an approximation of the ratio ). However, Monte Carlo based estimates are stochastic and special care is needed when working with them in an MCMC scheme. An algorithm to make use of Monte Carlo based estimates inside a Metropolis Hastings algorithm is given in Algorithm 5 (it is implemented in the software that we provide).
There are many variants of Algorithm 5, depending upon how and at what stage the auxiliary sample is generated see Chapter 4 in Liang, Liu & Carroll (2011). The invariant stationary distribution of such Markov chains is not necessarily the target density . The characteristics of such chains and their convergence properties are discussed by Beaumont (2003) and Andrieu & Roberts (2009). In Algorithm 5, the auxiliary sample is refreshed in each iteration and the same sample is used to estimate both and . This makes Algorithm 5 more robust compared to other similar alternatives. In classical MCMC, one can reuse the previous estimate of when computing . However, when the Metropolis Hastings ratio is stochastic, if is not evaluated in each iteration using a fresh sample of , then the MCMC chain tends to get stuck at a stochastic maxima of the estimated likelihood (Sharma et al., 2014). The smaller the size of the auxiliary sample, or the more inaccurate the Monte Carlo estimate of , the worse is this problem. Using the same sample to estimate both and leads to lower noise in the estimated ratio of . This property was also noticed by McMillan & Binney (2013) in the context of fitting models of the gravitational potential of the Milky Way to spatio-kinematic data of stars orbiting inside it. Two specific cases where the above algorithm can be used are given below.
3.9.1 Unknown normalization constant
In fitting a model to data, we are interested in sampling . To do this, the function should be properly normalized over the data space, in the sense that . However, on many occasions, we have
[TABLE]
where is known but the normalization constant is not known. An example is the problem of fitting a density profile ( being the Galactocentric distance) to a sample of stars with Galactic latitude , longitude and heliocentric distance kpc. Here we have .
Our aim is to compute the Metropolis Hastings ratio that is used to advance an MCMC chain, and it is the ratio that is unknown. If one can sample exactly from , then it is possible to cancel the normalization constant using ingenious algorithms by Møller et al. (2006) and Murray, Ghahramani & MacKay (2006). However exact sampling is not always feasible. In such cases a Monte Carlo estimate of the ratio of the unknown normalization constant can be done using samples generated from density , such that
[TABLE]
This sampling can be done by various means, e.g., exact sampling, MCMC, and rejection sampling. If is difficult to sample from, one can use so-called “importance sampling” by drawing samples from a distribution that is easy to sample from. The required ratio of normalization constants is then given by
[TABLE]
and the MH ratio is given by .
3.9.2 Marginal inference
Here we are interested in the marginal density , but the integral may not be analytically tractable and may also be difficult to do by deterministic schemes. In such situations, the integration can be done by Monte Carlo importance sampling, using auxiliary samples generated from some density that is easy to sample from. Thus we have
[TABLE]
3.10 Hamiltonian Monte Carlo
One of the attractive features of MCMC for sampling pdfs is its better performance for higher dimensions. However, for very large dimensions, traditional MCMC algorithms start running into problems. While for lower dimensions, a typical set of the posterior (e.g. region encompassing 99% of the total probability) lies close to the center, for higher dimensions, a typical set lies in a shell that has a very large volume. Since, a shell cannot be traversed with large step sizes, it takes a long time to explore the posterior.
Hamiltonian Monte Carlo (HMC) tries to address this problem by introducing an auxiliary variable called momentum for each real variable called position (Duane et al., 1987; Neal, 1993). The log of posterior (target density) is assumed to define the potential energy , and the momenta define the kinetic energy . Together they define the Hamiltonian , where . The distribution to be explored is
[TABLE]
Next, principles of Hamiltonian dynamics are used to advance a given point to a new location. The point is then accepted or rejected based on the MH algorithm. The use of Hamiltonian dynamics to advance a given point allows the point to travel to locations which are far from its current location. This allows faster exploration of the parameter space.
There are two major obstacles involved with using HMC, and this has prevented its widespread use. First, it requires the gradient of the target density. Secondly, it requires two extra parameters to be tuned by the user: a step size to advance from the current state and the number of steps over which to evolve the Hamiltonian system. Considerable progress has been made to address both these issues.
The automatic/algorithmic differentiation can be used to accurately compute the derivatives of a given function without any user intervention (Griewank & Walther, 2008). The idea is that any function written as a computer program can be described as a sequence of elementary arithmetic operations, and then by applying the chain rule of derivatives repeatedly on these operations, the derivatives can be computed. Alternatively, one can create analytical functions to approximate the target density and use these to compute the derivatives. This is because the exact Hamiltonian is only required when computing the acceptance probability and this does not require derivatives. For simulating the trajectory, one needs derivatives and here one can use an approximate Hamiltonian (Neal, 2011). An application of HMC for fitting cosmological parameters is given by Hajian (2007) and Taylor, Ashdown & Hobson (2008). Homan & Gelman (2014) provide additional algorithms for automatic tuning of step size and the number of steps , known as the No-U-Turn Sampler. This is used in the open-source Bayesian inference package Stan (available at http://www.mc-stan.org).
3.11 Population Monte Carlo
Population Monte Carlo is an iterative importance sampling technique that adapts itself at each iteration and produces a sample approximately simulated from the target distribution. The sample along with its importance weights can be used to construct unbiased estimates of quantities integrated over the target distribution. Suppose is a quantity of interest. One of the major applications for MCMC applications is to compute integrals like . In importance sampling, this is replaced by
[TABLE]
where are sampled from a distribution which is easier to sample than . The closer the importance function to the target distribution, the better the quality of the estimate (lower variance). In practise it is difficult to guess a good importance function.
The main idea in population Monte Carlo is to start with a reasonable guess of the importance function and then iteratively improve by making use of the past set of samples . The importance function can adapt not only in time (with each iteration), but also in space, and can be written in general as . Suppose are the set of points at iteration . Let be produced from importance distribution . An estimate of is then given by
[TABLE]
Thus the expectation value of any function computed using importance sampling is unbiased, i.e.
[TABLE]
Here is distribution of and the equality is valid for any .
A simple choice for the importance function is to have set it as a mixture of normal or -distributions, e.g., (Cappé et al., 2008). This has been used for cosmological parameter estimation (Wraith et al., 2009) and model comparison (Kilbinger et al., 2010).
3.12 Nested Sampling
In Section 2.5, we saw that computing the evidence is computationally challenging. Nested sampling (Skilling, 2006) is designed to ease this computation. To compute the evidence, we are interested in computing quantities like
[TABLE]
Integration is basically chopping up the full space into small volume elements and summing the contribution of the integrand over these cells. We are free to chop up the volume and order or label the cells as we wish. So we divide the space by iso-likelihood contours and define a variable to label them. A convenient choice is the prior probability mass enclosed by an iso-likelihood contour, i.e.
[TABLE]
If the the prior probability is normalized, then it ranges from 0 for the highest likelihood, to 1 for the lowest likelihood. Given the above definition, we can also define an inverse function , which is the likelihood that encloses a probability mass of . So the integral for can now be written as .
Suppose we generate samples uniformly from the prior distribution. Next, we sort them in decreasing sequence of to give prior mass . Then using trapezoidal rule, one can easily perform the numerical integration. However, a significant contribution to the integral comes from a region with small prior mass . So, the integral should be done in equal steps in rather than . This can be done using an iterative procedure. We start with a set of points drawn from the prior. At each iteration, let be the point with lowest ; we replace it in set with a new point drawn uniformly from the prior but satisfying . This generates a sequence of for which the expected .
Nested sampling is widely used for cosmological model selection and parameter estimation. Three publicly available packages based on nested sampling are CosmoNest (Parkinson, Mukherjee & Liddle, 2006; Mukherjee, Parkinson & Liddle, 2006, see https://github.com/dparkins/CosmoNest), MultiNest (Feroz, Hobson & Bridges, 2009, see https://ccpforge.cse.rl.ac.uk/gf/project/multinest) and DNEST (Brewer, Pártay & Csányi, 2011, see https://github.com/eggplantbren/DNest4).
4 Bayesian hierarchical modelling (BHM)
In the simplest setting, we have some observed data generated by some model having parameters which can be inferred using the Bayes theorem as
[TABLE]
where denotes our prior knowledge or belief about . If the model parameters depend upon another set of parameters through , then and can be inferred using
[TABLE]
The variable is known as the hyperparameter and , the distribution of the hyperparameter, as a hyperprior. Alternatively, the observed data may depend upon another set of hidden variables , which in turn depend on . The inference of and can then be established using
[TABLE]
Such situations lead to hierarchies and Bayesian models of this type are known as hierarchical models. It turns out that hierarchies are quite common in real world applications, often where more than two levels exist, and Bayesian hierarchical modelling provides a framework for capturing this.
Let us consider a simple example, for details see Gelman et al. (2013). Suppose we observe some data (a set of measurements of some variable ) with uncertainty , and we are interested in the mean . Now suppose that the data are grouped into independent groups, and we have reason to believe that the group mean varies from group to group. For observations within a group , our model is
[TABLE]
where we denote by an observation belonging to group . We now compute the group mean , instead of global mean , to capture the variation of mean across groups. A global mean is certainly an inaccurate description of data, whenever the group mean is far away from the global mean. However, if the number of data points in a group is very small, e.g., , then the uncertainty in the group mean is large and it is much better to trust the global mean than the group mean.
Bayesian hierarchical modelling provides a natural way to handle the above problem of group means. It can act like a middle ground between the two extremes, global mean versus group mean. To demonstrate this, we set up the above problem using a Bayesian hierarchical model. Suppose the group means are distributed according to a normal distribution
[TABLE]
where and are unknown parameters of the model. The , and the group means can then be inferred from data using
[TABLE]
We generated synthetic data with , , , and ; we then estimated , and (assuming flat priors for and ). The results are shown in Figure 6. The BHM based group mean estimates are systematically shifted with respect to standard group mean estimates ( computed from the data points in a group). The BHM estimates are closer to the global mean than the standard estimates. The shift between the two estimates is more for cases where the error bars are large. The BHM estimates also have smaller error bars. This is because , when estimating the group mean, in addition to points within a group the BHM model also makes use of information available from other groups.
4.1 Expectation maximization, data
augmentation and Gibbs sampling
The easiest way to analyze a Bayesian hierarchical model is via Gibbs sampling, and the motivation for doing this was provided by the the expectation maximization (EM) algorithm. In fact, the EM algorithm led to the development of the DA algorithm, which in turn provided the idea to use Gibbs sampling to solve Bayesian hierarchical models.
Hence we begin by exploring the EM algorithm (Dempster, Laird & Rubin, 1977) which is one of the most influential algorithm in the field of statistics. Let us suppose that we have some observed data generated by some model having parameters . We want to compute the most likely parameters of the model given the data, i.e., . The full model is specified by with , where are variables which are either missing or hidden or unobserved. The EM algorithm solves this problem as follows. The algorithm has two steps. It starts with a fiducial value of , then does the following at every iteration .
- •
E-step: Compute . In other words, it computes the expectation of the log likelihood with respect to .
- •
M-step: Find the value of that maximizes and set .
These steps are repeated iteratively until . The proof that the EM algorithm increases the likelihood at each stage is as follows. The conditional density of the missing data given the observed data and the model parameter is given by
[TABLE]
Taking the and then the expectation with respect to , we get
[TABLE]
which is valid for any . Using this result, we can compute the difference
[TABLE]
Due to the M-step, . Also, from Gibbs’ inequality, . This means that each EM iteration is guaranteed to increase the marginal likelihood . This guarantees a convergence towards a maximum, but not necessarily a global maximum. The algorithm can still get stuck at a saddle point, or a local maximum.
The EM algorithm as presented above is deterministic. In general, it is not always easy to compute the expectation value, as it involves integrals over high dimensions. A general way to compute the , would be to draw random samples of from distribution and take its mean. We label this stochastic estimate , which in the limit is same as . Having computed , the M-step can proceed as usual to maximize it and compute a new . In fact can be set to 1. This is the stochastic version of EM (SEM) as given by Celeux & Diebolt (1985). Because of stochasticity, one does not get a unique answer but instead a distribution. In fact, SEM generates a Markov chain, which under mild regularity conditions converges to a stationary distribution. The algorithm has an additional advantage in that it is less likely to get stuck at a local maximum.
If we now replace the M-step with a draw of from the , this becomes a fully stochastic method ; this is, as previously mentioned the DA algorithm of Tanner & Wong (1987). This is equivalent to a two-step Gibbs Sampler for sampling from
[TABLE]
Sample from . 2. 2.
Sample from .
From the properties of the Gibbs sampler, we know that the sequence of forms a Markov chain that samples . Although Gibbs sampling requires sampling from the conditional distribution, the inner step can be replaced by MH sampling, leading to the Metropolis-within-Gibbs method as discussed in Section 3.3. This provides a completely general scheme for handling missing data.
Finally, the DA algorithm is not limited to just missing variables of the data, but can also be applied to unknown parameters of the model, e.g., in
[TABLE]
Such dependencies are common in Bayesian hierarchical modeling. In general, the Bayesian hierarchical modeling provides a framework for handling marginalization in Bayesian data analysis, i.e., handling parameters or variables that are either unknown or missing but are necessary to model the data.
4.2 Handling uncertainties in observed data
Marginalization is not limited to handling missing data. It can also be used to handle data with uncertainty . Consider
[TABLE]
where is the true values of the observed data . Here again, instead of doing an integration, one treats the true values as unknowns and sample them using the Gibbs scheme. We demonstrate this with a simple example where is the model that generates the data, and are the unknowns which we wish to evaluate. The data has uncertainty described by another Gaussian function .
For this simple case, the integral in Equation (83) leads to an analytical expression
[TABLE]
We used and to generate test data and then estimated and using two schemes: (1) DA algorithm which uses Equation (84) and treats as unknown and samples from it, and (2) explicit integration scheme which uses Equation (85) where the variable has been integrated out of the equation. The Markov chain was run for 100,000 iterations. Figure 7 shows the pdf of the estimates of the two parameters. Both schemes give identical results. The autocorrelation function for the two parameters are shown in Figure 7. The DA algorithm has a slightly higher autocorrelation time as it has to sample an extra parameter for each data point.
5 Case studies in astronomy
In this section, we study a range of cases in astronomy where MCMC based Bayesian analysis is making a significant impact. The emphasis is on showing how to set up a diverse range of problems within the Bayesian framework and how to solve them using MCMC techniques. The examples are intentionally chosen from different areas of astronomy so as to demonstrate the ubiquity of the techniques reviewed here. There is a long history of applying such techniques in the field of cosmology, and excellent reviews and books already exist here: Trotta (2008); Hobson (2010); Parkinson & Liddle (2013).
5.1 Exoplanets and binary systems using radial
velocity measurements
The presence of a planet or a companion star results in temporal variations in the radial velocity of the host star. By analyzing the radial velocity data, one can draw inferences about the ratio of masses between the host and the companion, and orbital parameters like the period and eccentricity. We now describe how to set up the above inference problem in a Bayesian framework. We begin by describing the predictive model for the radial velocity of a star in a binary system.
The radial velocity of a star of mass in a binary system with companion of mass in an orbit with time period , inclination and eccentricity is given by
[TABLE]
The true anomaly is a function of time, but depends upon , , and via,
[TABLE]
An example of radial velocity data is shown in Figure 8 which shows the radial velocity for two binary systems (the green and the red line) that differ in but have same values for all other parameters and . The figure demonstrates that the radial velocity is sensitive to the eccentricity of the orbit. {marginnote} \entrythe mean velocity of the center of mass of the binary system \entrythe inclination of the orbital plane with respect to the sky (angle between orbital angular momentum and line of sight) \entry the angle of the pericenter measured from the ascending node (the point where the orbit intersects the plane of the sky) \entry time of passage through the pericentre
The actual radial velocity data will differ from the perfect relationship given in Equation (86) due to observational uncertainty (variance ) and intrinsic variability of a star (variance ) and we can model this by a Gaussian function . For radial velocity data defined as a set of radial velocities at various times , one can fit and constrain seven parameters, , using the Bayes theorem as shown below
[TABLE]
We generated test data using Equation (86) and then, using the above equation, we tried to recover the parameters (available in the supplied software). The posterior distribution was sampled using MCMC, and the results are shown in Figure 8. Panel shows the test data along with the best fit curve. It also shows the radial velocity for the case with . Panel shows the posterior distribution of the parameters and .
If we have data for a large number of binary systems, we can use it to explore the distribution of orbital parameters. A naive way to do this would be to get a “maximum a posteriori” (MAP) estimate of the orbital parameters for each star and then study the population distribution by constructing histograms out of it. Such a scheme will give incorrect estimates of the population distribution as the uncertainty associated with the parameter estimates is ignored. In addition to this, as discussed by Hogg, Myers & Bovy (2010), the MAP estimates are in general biased. In the context of radial velocity data, the estimates of are biased high. The problem is especially acute if the uncertainty associated with the parameters is large, which is often the case with radial velocity data from barycentric motions.
All of these problems can be avoided by setting up the problem of estimation of population distributions as a hierarchical Bayesian model. Let us suppose we have radial velocity data for binary star systems, and denote by the radial velocity data set for the -th system. Let be the orbital parameters for the -th system. Finally, let be the set of hyperparameters that govern the population distribution of the parameters . The problem to determine can be set up as
[TABLE]
This is a BHM and can be sampled using the Metropolis-within-Gibbs scheme discussed in Section 4.1. The parameters can be estimated alongside , and to get the marginal distribution , one can simply ignore the computed .
However, the above scheme is not well suited to explore a variety of population models, especially if sampling from is computationally demanding. We now show a computationally efficient scheme by Hogg, Myers & Bovy (2010) that can in general be applied to BHMs of two levels. The marginal distribution of hyperparameters that we are interested in is given by
[TABLE]
The integral on the right hand side can be estimated using a Monte Carlo integration scheme as follows:
[TABLE]
with sampled from , which can be done by an MCMC scheme.
5.2 Data driven approach to estimation of stellar parameters from a spectrum
The spectrum of a star contains information about its properties like temperature, gravity and the abundance of different chemical elements that make up the star. Decoding information about stellar parameters from a stellar spectrum is a problem of great significance for astronomy. With the advent of large spectroscopic stellar surveys having several hundred thousand spectra, the need for fast and accurate methods to analyze the stellar spectra has gained prominence. Let us denote the stellar parameters (e.g., and ) by label vector and the observed spectrum by vector , denoting normalized flux at specific wavelengths indexed by (see Figure 9). The problem is to find given , which using the Bayes theorem can be written down as
[TABLE]
Here denotes a probabilistic generative model for the data, with being the parameters of the model. If we denote by the flux predicted by the model at wavelength and by the variance or scatter about this relation (assuming Gaussian noise), then the probabilistic generative model for the full spectrum can be written as
[TABLE]
Traditionally, is calculated from first principles using a physical theory for the formation of spectral lines in a stellar atmosphere specified by stellar parameters . Frequently, is evaluated on a grid defined on and then interpolation is used to get the spectrum for any arbitrary value of . The can also be computed by interpolating over a library of empirical spectra with predefined stellar parameters. A more refined data driven approach to the problem using machine learning techniques was presented in Ness et al. (2015). In this approach, is approximated by a simple (linear or quadratic) function of label vector . Therefore
[TABLE]
Let us consider a training set of stars with label vectors and corresponding set of fluxes at wavelength by . One can estimate by sampling within MCMC such that
[TABLE]
Having obtained the model parameters , one can now estimate stellar parameters of a new star with given spectrum using Equation (92). This is the basis of The Cannon algorithm (Ness et al. 2015) which is already widely used by the stellar community. The ability of the algorithm to model the spectra is demonstrated in Figure 9 which shows the spectra of fours stars along with the best-fit spectra for each of them.
5.3 Solar-like oscillations in stars
Solar-like oscillations, which are excited and damped in the outer convective envelopes of a star, are seen in stars like the Sun and red giants. With the advent of space-based missions like Kepler and COROT that provide high quality photometric data over a long time series, it has now becomes feasible to detect solar-like oscillations in tens of thousands of stars (Stello et al., 2013, 2015). Typically, the power spectrum of a star with solar-like oscillations (Figure 10) shows a regular pattern of modes, characterized by a large frequency separation . The overall amplitude is modulated by a Gaussian envelope and this is characterized by the frequency of maximum oscillation . Theory suggests that for a given star is related to its density (Ulrich, 1986), whereas the is related to its surface gravity and temperature (Brown et al., 1991; Kjeldsen & Bedding, 1995). Using the above two relations, the mass and the radius of a star can be constrained. The mass of a red giant is sensitive to its age and this makes asteroseismology very useful for understanding Galactic evolution (Chaplin et al., 2011; Sharma et al., 2016). For further details on solar type oscillations see review by Chaplin & Miglio (2013).
Bayesian-MCMC based techniques are increasingly being adopted to extract seismic properties, e.g., and , by analyzing the power spectrum generated from the time series photometry of a star (Gruberbauer et al., 2009; Kallinger et al., 2010; Handberg & Campante, 2011). The probability that an observed power spectrum at frequencies is produced by a model spectrum (specified by a set of parameters ), is given by
[TABLE]
as shown by Duvall & Harvey (1986). This forms the basis for the Bayesian treatment of the problem of estimation of parameters by . The power density is modelled as a sum of super-Lorentzian functions
[TABLE]
To fit the individual modes, one assumes Lorentzian profiles. Spherical harmonics are used to describe the oscillations; the modes are characterized by three wave numbers, and . In Kallinger et al. (2010), eight main modes are fitted (three and and two ), parameterized by the mode lifetime , the central frequency , three spacings and , and the amplitudes and .
[TABLE]
Figure 10 shows the result of fitting the above model to power spectra of fours stars observed by the Kepler mission.
5.4 Extinction mapping and estimation of intrinsic
stellar properties
Given the mass and initial composition (e.g., metallicity [M/H]) of a star, we can use the theory of stellar evolution to predict its state and composition at a later time (age ). However, the intrinsic parameters like mass , [M/H] and are not directly observable. For most stars we only have photometric information, apparent magnitudes in different photometric bands (for example and ). The photometry of a star depends upon temperature , gravity , [M/H], distance and extinction (proportional to the dust density integrated along the line of sight to the location of the star). If we have spectroscopy, then we can get temperature , and even composition, but with uncertainties. From asteroseismology, we can get average seismic parameters like and , which are sensitive to the mass, radius and temperature of a star. Given this state of affairs, it is quite common to ask the question that, given a certain set of observables of a star, what are the intrinsic parameters of a star or even some other set of observables. For example, given the photometry of a star, what is the distance, temperature and gravity of a star; or given photometry and distance, what is the temperature and gravity of a star; or given photometry and spectroscopy, what is the distance? And so on. Knowing the intrinsic parameters of a star is also important for understanding the formation and evolution of the Galaxy, for example, the star formation rate, the age-metallicity relation and the distribution of dust in the Galaxy.
The problem of estimating intrinsic stellar parameters of a star given some observables can be formulated as follows. Let be a set of observables associated with a star and their uncertainties. Let us denote the intrinsic variable of a star that we are interested in by . To specify prior probabilities on we need a Galactic model, and we denote by the parameters of such a model. Typically, real catalogs have selection effects, e.g., stars selected to lie in some apparent magnitude and color range, or a set of stars with parallax error less than 10%, or stars with missing information in certain bands. To specify selection effects, we denote the event that a star exists in a catalog by . From theoretical isochrones we can predict given , in other words a function exists. However, we are interested in the inverse problem of estimating given . A Bayesian introduction to solving such a problem was given by Pont & Eyer (2004) and Jørgensen & Lindegren (2005) in the context of estimating ages. The method was further improved and refined by Burnett & Binney (2010); Burnett et al. (2011) and Binney et al. (2014) in the context of the estimation of distances, with a better treatment of priors and selection effects (see also Sale, 2012, 2015). From the Bayes theorem we have
[TABLE]
We now explain each of the terms in detail.
is the posterior distribution of intrinsic parameters given the observables, the selection function and a Galactic model. 2. 2.
is the selection function. This says given the observables what is probability that a star was observed. Typically this can be expressed as . The term enters in situations where the value of an observable is not known but constraints on it are. Then . For example, a parallax of a star is known to be greater than a certain limit, or the apparent magnitude of a star may be missing in a band because the star is too bright or faint (Burnett & Binney, 2010; Sale, 2012). 3. 3.
is the likelihood of the data given the uncertainty and the intrinsic parameters. This can be described by a Gaussian function for each . 4. 4.
is the prior. This describes the distribution of mass, metallicity, age and spatial distribution of stars in the Galaxy. More specifically it can be written as , where the sum is over different Galactic components, e.g., thin disc, thick disc, bulge and stellar halo.
We now focus on the problem of estimating distance and extinction. For simplicity, we ignore the selection effects; for an in depth discussion, see Sale (2015). By marginalizing over stellar parameters and one obtains . If we have stars along a line of sight, we can estimate the distance-extinction relationship parameterized by as
[TABLE]
The above method is used by Green et al. (2014, 2015), to construct three dimensional maps of interstellar dust reddening using Pan-STARRS 1 and 2MASS photometry (Figure 11). To estimate , Green et al. (2015) do a kernel density estimate over samples generated by MCMC, while Sale & Magorrian (2015) present a method based on the Gaussian mixture model. As described in Sale (2012), we can also directly estimate and intrinsic parameters of each star along a line of sight by setting up the problem as a BHM and sampling from the following posterior:
[TABLE]
The Metropolis-within-Gibbs scheme is used to accomplish this sampling.
5.5 Kinematic and dynamical modelling of the Milky Way
Understanding the origin and evolution of the Milky Way has received significant boost due to the emergence of large data sets that catalog the properties of stars in the Milky Way (Binney, 2011; McMillan & Binney, 2012, 2013; Rix & Bovy, 2013; Binney, 2013; Bland-Hawthorn & Gerhard, 2016). Bayesian methods and MCMC based schemes are now playing a prominent role in the analysis and interpretation of such large and complex data sets from, e.g., the GCS survey (Schönrich, Binney & Dehnen, 2010), the SEGUE survey (Bovy et al., 2012b), the APOGEE survey (Bovy et al., 2012a; Bovy & Rix, 2013), and the RAVE survey (Sharma et al., 2014; Piffl et al., 2014; Sanders & Binney, 2015). We focus on the problem of determining the mass distribution, or equivalently the gravitational potential of the Milky Way, using halo stars (Kafle et al., 2014) and disc masers (McMillan, 2017).
The observational data of stars in the Milky Way is in heliocentric coordinates and is in the form of angular positions on sky (Galactic longitude and latitude ), heliocentric distance (), heliocentric line of sight velocity (), and proper motion (tangential motion on the sky, and ). The velocity of halo stars can be described by a simple Gaussian model of the following form
[TABLE]
for which is the set of parameters that govern the velocity dispersion profiles and . The coordinates are in the Galactocentric reference frame. The observed heliocentric coordinates can be converted to Galactocentric coordinates using prior estimates of the location and the motion of the sun. For the stellar halo stars, tangential velocities cannot be accurately determined. The distance also has some uncertainty, . Hence we marginalize over unknown tangential velocities and true distance , to obtain
[TABLE]
The parameters can now be estimated using the data of multiple stars by
[TABLE]
The marginalization in Equation (103) can be handled in various ways. One can make use of deterministic numerical integration techniques (Gaussian quadrature) or one can achieve marginalization via Monte Carlo schemes making use of importance sampling. For Monte Carlo based integration one can make use of the MCMH algorithm discussed in Section 3.9. Alternatively, one can treat and as unknowns by setting them up as a BHM and estimate them alongside by making use of the Metropolis-within-Gibbs scheme discussed in Section 4.1. The radial velocity dispersion profile of halo stars computed using blue horizontal branch and red giant stars in the SEGUE survey is shown in Figure 12 (Kafle et al., 2014).
We now proceed to estimating the potential . Given , density of halo stars and anisotropy as function of distance from the Galactic center, one can solve for . Let be the set of parameters used to define the above profiles. So for given , the model makes a prediction for radial velocity dispersion at a location . This can be compared with the estimated from the observed data. The probability of model parameters is then given by
[TABLE]
The posterior distribution for the virial mass and the concentration parameter of the Milky Way halo using BHB and giant stars is shown in Figure 12 (Kafle et al., 2014).
We now discuss ways to incorporate prior information into the analysis. For example, the angular velocity of the Sun with respect to the Galactic Center is well constrained to be within (Reid & Brunthaler, 2004). The vertical force at kpc above the Sun, in terms of surface mass density, is given by (Kuijken & Gilmore, 1991). Let us denote such constraints by . Additional data sets , constraining a certain subset of parameters can also exist. For example, the tangent point velocities or terminal velocities as a function of Galactic longitude help to constrain the shape of the circular velocity curve . The additional priors and data all enter as multiplicative factors in the posterior, which is given by
[TABLE]
The halo stars carry little information about the mass distribution close to the center and in the disc of the Milky Way. Galactic masers associated with high mass star forming regions are very good tracers of the Milky Way disc which makes them excellent candidates for studying the potential of the Milky Way (Reid et al., 2009; McMillan, 2011; Reid et al., 2014; McMillan, 2017). Due to extremely accurate astrometric information using very long baseline interferometry, one has very accurate parallax () and proper motion measurements. When combined with line of sight velocities from Doppler shift of spectral lines, one ends up with full 6D phase space information for these sources. Maser sources, are young and have very little random motion which means their orbits are highly circularized. The distribution of velocities can be described by a simple three dimensional Gaussian function, i.e.
[TABLE]
Here, is any systematic streaming velocity associated with the masers and is the velocity dispersion about the mean motion. Now, we have
[TABLE]
The last term is evaluated using Equation (107), by converting from heliocentric coordinates to Galactocentric coordinates . Let denote the full data of stars then
[TABLE]
This when put in Equation (106) gives the posterior distribution of model parameters.
6 Concluding remarks
The power of the Bayesian probability theory lies in the fact that it is mathematically simple, being based on just two elementary rules, and yet it is broadly applicable. However, Bayesian calculations can be computationally demanding, and this has acted as a major bottleneck in the past. But with the increase of computational power, we have witnessed a sharp increase in the adoption of Bayesian techniques. More recently, free availability of black-box computer packages to efficiently sample from Bayesian posterior distributions has further accelerated the adoption of Bayesian techniques in astronomy.
Robust algorithms are now available to sample multidimensional and complex pdfs. The MH algorithm is still the main workhorse of MCMC methods. Good solutions now exist for the issue of application specific tuning of the proposal distribution in the MH algorithm, e.g., adaptive Metropolis schemes and the affine invariant samplers. The MH algorithm when combined with parallel tempering allows one to sample a wide variety of commonly occurring distributions. Situations, in which the posterior is not analytically tractable, can also now be solved using the Monte Carlo version of the MH algorithm.
Bayesian methods also provide a framework for model comparison via the use of Bayesian evidence. However, efficient computing of evidence still remains a challenge. Various alternate criteria for comparing models exist and importantly these can make use of the computed MCMC chain.
Bayesian hierarchical models further increase the usefulness of the Bayesian framework. They can solve missing data problems, marginalization over variables, convolution with observational uncertainties and so on. This makes a wide class of complex problems suddenly solvable. We showed that the Metropolis-within-Gibbs scheme is ideally suited for sampling posteriors generated by Bayesian hierarchical models and also provide a software for doing this.
Multimodal distributions still pose a problem for most MCMC algorithms. Parallel tempering can overcome them but requires more computational time and a careful choice of ladder. If dimensionality of the space being explored is very high and the distribution is complex, efficient exploration is not easy. Techniques are being developed to solve such problems that make use of derivatives of the posterior distribution, e.g., Hamiltonian Monte Carlo. However, more work is required in this area. Efficient exploration of multi-level hierarchical models will play an increasingly important role in future studies.
Communication of Bayesian results is also an area where we anticipate improvements. Traditionally, the estimates are reported by means of confidence intervals. However, there is much more information in the MCMC chain, in particular, the correlation between different variables. Also, there is an increasing need to feed results of one MCMC simulation into another. Such requirements are best addressed by reporting the full pdfs or the thinned samples from it. Other alternatives that are economical in terms storage space are to approximate the pdf by analytical functions or to employ Gaussian mixture models. We also need better tools to visualize the Bayesian-MCMC output, specially for high dimensional and complex hierarchical models. Such tools will allow us to understand as to why a model fails and how we should improve it.
There are key topics which we have not addressed here. Non-parametric Bayesian methods are becoming increasingly important, e.g., Gaussian processes (Beaumont, Zhang & Balding, 2002) and Dirichlet process mixture models (Neal, 2000). Magorrian (2014) uses this method to estimate the gravitational potential of the Milky Way.
Astronomy is no longer a data-starved science. With projects like the Large Synoptic Survey Telescope and the Square Kilometre Array, the quality and quantity of data are going to increase dramatically in the coming years. Better quality and larger quantity of data means that we can expect our data to answer more difficult questions, which in turn means more complex models (e.g. multi-level hierarchies and a higher dimensional parameter space). Given that MCMC is a computationally expensive scheme, there will be an increasing demand for such techniques that can make full use of the vast quantity of data on offer and deliver results in an affordable amount of time.
Equivalently, MCMC schemes that make use of computing environments with multiple processor and graphic processor units would also be useful. An MCMC chain is serial by nature and it requires special care to parallelize an MCMC algorithm, e.g., use of an ensemble of chains (Foreman-Mackey et al., 2013) or parallelizing the posterior computation by splitting up the data. Relaxing the condition of reversibility can lead to MCMC algorithms with faster mixing properties (Chen, Lovász & Pak, 1999; Diaconis, Holmes & Neal, 2000; Girolami & Calderhead, 2011). Finally, the development of approximate methods, both application specific and general, that can reduce the computational cost without significantly compromising the quality of results also hold great promise for analyzing large data sets. Approximate Bayesian computation is one such framework (Beaumont, Zhang & Balding, 2002); see Bovy (2016) for its use in astronomy to study the chemical homogeneity of stars in open clusters.
DISCLOSURE STATEMENT
The author is not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.
ACKNOWLEDGMENTS
I am indebted to my colleague Joss Bland-Hawthorn for suggesting this article and for supervising its development over the past year. I am thankful to James Binney, Jo Bovy, Brendon Brewer, Prajwal Kafle and Prasenjit Saha for numerous suggestions and discussions from which the review has benefited significantly. I am also thankful to David Hogg for words of encouragement on the draft. I acknowledge funding from a University of Sydney Senior Fellowship made possible by the office of the Deputy Vice Chancellor of Research, and partial funding from Bland-Hawthorn’s Laureate Fellowship from the Australian Research Council.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Akaike (1974) Akaike H. 1974. IEEE transactions on automatic control 19:716–723
- 2Andrieu & Robert (2001) Andrieu C, Robert CP. 2001. Controlled mcmc for optimal sampling. Tech. rep., Citeseer http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.2048
- 3Andrieu & Roberts (2009) Andrieu C, Roberts GO. 2009. The Annals of Statistics 37:697–725
- 4Andrieu & Thoms (2008) Andrieu C, Thoms J. 2008. Statistics and Computing 18:343–373
- 5Barker (1965) Barker A. 1965. Australian Journal of Physics 18:119–134
- 6Bayes & Price (1763) Bayes M, Price M. 1763. Philosophical Transactions 53:370–418
- 7Beaumont (2003) Beaumont MA. 2003. Genetics 164:1139–1160
- 8Beaumont, Zhang & Balding (2002) Beaumont MA, Zhang W, Balding DJ. 2002. Genetics 162:2025–2035
