Molecular simulations minimally restrained by experimental data
Huafeng Xu

TL;DR
This paper derives equations for restrained molecular simulations that minimally perturb unbiased distributions while fitting experimental data, improving the integration of experimental constraints into simulations.
Contribution
It introduces a new method to restrain simulations, ensuring minimal deviation from unbiased distributions while matching experimental observable values.
Findings
Provides equations for equilibrium distributions under restraints
Suggests a method for minimally perturbing simulations
Enables accurate reproduction of experimental data within uncertainties
Abstract
One popular approach to incorporating experimental data into molecular simulations is to restrain the ensemble average of observables to their experimental values. Here I derive equations for the equilibrium distributions generated by restrained ensemble simulations and the corresponding expected values of observables. My results suggest a method to restrain simulations so that they generate distributions that are minimally perturbed from the unbiased distributions while reproducing the experimental values of the observables within their measurement uncertainties.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Molecular simulations minimally restrained by experimental data
Huafeng Xu
Silicon Therapeutics LLC, Boston, USA
Abstract
One popular approach to incorporating experimental data into molecular simulations is to restrain the ensemble average of observables to their experimental values. Here I derive equations for the equilibrium distributions generated by restrained ensemble simulations and the corresponding expected values of observables. My results suggest a method to restrain simulations so that they generate distributions that are minimally perturbed from the unbiased distributions while reproducing the experimental values of the observables within their measurement uncertainties.
I Introduction
Molecular simulations are routinely used to generate models of molecular structures and their equilibrium distributions Dror et al. (2012). They provide structural and dynamic information at atomistic resolution that are critical to the interpretation of experimental data Bottaro and Lindorff-Larsen (2018). Inaccuracy in the empirical force fields used to compute the potential energy of the biomolecules in the simulations, however, can cause the simulations to generate inaccurate equilibrium distributions of molecular structures, manifested by the disparity between the experimentally measured values of observables and the predictions by the simulations Gunsteren, Dolenc, and Mark (2008). Biasing the simulations toward reproducing the experimental values of the observables is a promising approach to improving the accuracy of the simulations Boomsma et al. (2014); Perez et al. (2016). How to introduce minimal bias, however, remains an open question, and no such method exists that takes into account the measurement uncertainties associated with the experimental data.
Because many experimental techniques, such as nuclear magnetic resonance (NMR) and small-angle X-ray scattering (SAXS), measure the ensemble average of observables, a natural and popular method to incorporate such experimental data is to simultaneously simulate many replicas of the molecular system while restraining the average of the instantaneous values of any observable of interest to be close to its experimental value Lindorff-Larsen et al. (2005); Vendruscolo (2007); Roux and Islam (2013); Hummer and Köfinger (2015). I will refer to such replica-based restrained simulations simply as restrained ensemble simulations. The equilibrium distribution generated by restrained ensemble simulations depends on the number of replicas and the strength of the restraints, the latter of which should reflect the measurement uncertainties of the observables Hummer and Köfinger (2015). There lacks, however, a clear theoretical underpinning for the choice of the number of replicas Olsson and Cavalli (2015).
The restrained simulations should be no more biased by the restraints than is necessary, which is a principle that motivated the alternative maximum entropy method, which seeks a distribution that reproduces the experimental values of the observables and is minimally perturbed from the unbiased distribution Boomsma, Ferkinghoff-Borg, and Lindorff-Larsen (2014); Różycki, Kim, and Hummer (2011); Groth et al. (1999); Massad et al. (2007). It has been established that in some limit of infinite number of replicas and infinitely strong restraints, restrained ensemble simulations generate a statistically equivalent distribution as the maximum entropy method Pitera and Chodera (2012); Roux and Weare (2013); Cavalli, Camilloni, and Vendruscolo (2013); Olsson and Cavalli (2015). In practice, however, the restraining strength is finite, in which case the two methods are not equivalent. In contrast to restrained ensemble simulations, the maximum entropy method constrains every observable to its exact experimental value, thus it does not take into account the measurement uncertainties, which are ubiquitous in all experiments.
Here I present theoretical results that enable the determination of the largest number of replicas to use in restrained ensemble simulations with fixed strength of restraints so that they generate a distribution minimally perturbed from the unbiased distribution while reproducing the experimental values of the observables within their measurement uncertainties. As I will demonstrate below, this is equivalent to the determination of the weakest strength of restraints to use at fixed number of replicas, provided that the latter is sufficiently large. The key result is an equation (Eq. LABEL:eqn:D-expected) that predicts the expected value of any–restrained and unrestrained–observable in restrained ensemble simulations for different numbers of replicas, thus permitting the selection of the appropriate number without computationally expensive trial and error. In addition, I derive an equation (Eq. 12) for the distribution generated by restrained ensemble simulations with a finite number of replicas and a finite restraining strength, which can be used to perform reweighted simulations that take into account of the experimental data with their attendant measurement uncertainties. The validity of these theoretical results are demonstrated numerically by Monte Carlo simulations. I illustrate the application of these results using the molecular dynamics (MD) simulations of the polyalanine peptide Ala5 restrained by experimental nuclear magnetic resonance (NMR) data.
II Theory
Consider an ensemble consisting of replicas of the molecular system, each with the atomic coordinates . Experiments measure the ensemble average of the individual contributions of the observables: , where are the instantaneous values of the vector of observables for the ’th replica. In an unbiased simulation, may deviate from experimental values . The posterior distribution given the experimental data Rieping, Habeck, and Nilges (2005); Hummer and Köfinger (2015) is
[TABLE]
where is the Boltzmann distribution of the molecular configurations under the potential energy function . Assuming a normal distribution in each experimental value, , where is the inverse of the statistical variances of the experimental measurements, becomes
[TABLE]
which is the distribution in the restrained ensemble simulations. The strength of the restraints, , is thus related to the measurement uncertainties associated with the experimental data. In the restrained ensemble simulations, replicas of the molecular system are simulated with the same energy function and at the same temperature, and they interact with one another only through the quadratic term in the exponent in Eq. 2.
A similar equation was derived by Hummer and Köfinger (Eq. 26 of Ref. Hummer and Köfinger, 2015), with the difference that they scaled the restraining term by and introduced a parameter () expressing the level of confidence in the unbiased reference distribution. My results below show that these two equations are equivalent at large .
For a fixed , the marginal distribution for a single replica, , tends to the unbiased distribution with increasing . Larger thus corresponds to higher confidence in the unbiased distribution. A reasonable choice of is to take the largest value such that the expected values of the observables in the restrained ensemble simulations remain within the measurement uncertainties of the experimental values: this generates a distribution minimally perturbed from the unbiased distribution while still reproducing the experimental data.
I will analyze the behavior of restrained ensemble simulations at sufficiently large numbers of replicas, i.e. . Let
[TABLE]
be a biased distribution for an arbitrary vector , which becomes the distribution generated by the maximum entropy method if is the corresponding Lagrangian multiplier Roux and Weare (2013). Dropping the subscript from the constant vector , the partition function for the restrained ensemble of replicas is
[TABLE]
where signifies the ensemble average according to the probability distribution of independent replicas (i.e., ), each sampled according to the biased distribution , and is the corresponding probability distribution of the random variable . In this ensemble of independent replicas, are independent and identically distributed (i.i.d.) random variables. Acccording to the central limit theorem, the random variable , for sufficiently large , tends to the normal distribution , where are the expected values of the observables in the biased distribution (I will use to denote the expected values in the biased distribution with parameter ), is the corresponding covariance matrix, and is the determinant of matrix . If and are close, in the sense that
[TABLE]
for a sufficiently small and some value of , such that the normal distribution is a good approximation to for all around , can be integrated out, and the partition function in Eq. 4 becomes
[TABLE]
where is the identity matrix. Eq. 6 forms the basis for deriving the key results of this work.
My first key result is that the expected value of any observable, restrained or unrestrained, in the restrained ensemble simulations can be predicted from the biased simulation of . Let be the vector of restrained and unrestrained observables, where are the restrained observables and are the unrestrained observables. Introducing
[TABLE]
, and , the expected values of the observables in the restrained ensemble simulations are given by
[TABLE]
where is the covariance between all the observables in the biased simulation.
Writing in the block form
[TABLE]
where is the covariance matrix for the unrestrained observables and is the covariance matrix between the unrestrained and the restrained observables, the expected values of the restrained observables and the unrestrained observables are, respectively,
[TABLE]
and
[TABLE]
Eq. LABEL:eqn:restrained-D-expected enables the determination of the largest such that the expected values of the restrained observables remain within the measurement uncertainties of their experimental values.
My second key result is that the marginal distribution for a single replica generated by the restrained ensemble simulations is
[TABLE]
where I used Eq. 6 for both and , and for large .
For , , and Eq. 12 reduces to
[TABLE]
which depends on through the product . This suggests that a restrained ensemble simulation with replicas and the strength of restraints is equivalent to another one with replicas and the strength of restraints . This equivalency, to my knowledge, has not been previously demonstrated except for the harmonic potentials Roux and Weare (2013), and it mathematically justifies the scaling by of the restraining term proposed by Hummer and Köfinger Hummer and Köfinger (2015).
Eq. 12 suggests that the restrained ensemble simulations generate the same distribution as the maximum entropy method if the number of replicas satisfies the following inequality
[TABLE]
This condition is the generalization of the condition derived for the special case of harmonic potentials Roux and Weare (2013). Under this condition, , the corresponding single-replica distribution becomes
[TABLE]
which is the same as if takes the value of the Lagrangian multiplier in the maximum entropy method such that . Under this condition, Eq. LABEL:eqn:restrained-D-expected becomes .
The Kullback-Leibler divergence from the distribution in the restrained ensemble simulations (Eq. 12) to that of is
[TABLE]
where
[TABLE]
Eq. 16 enables the quantitative estimation of the difference between the distributions generated by the restrained ensemble simulations and by the maximum entropy method. Minimization of with respect to yields the number of replicas to use such that the restrained ensemble simulations generate a distribution closest to that of the maximum entropy method.
III Results and Discussions
I demonstrate the validity of Eq. LABEL:eqn:D-expected and Eq. 16 using the Monte Carlo simulations of a 2-dimensional potential that generates a distribution of mixed Gaussians:
[TABLE]
where and are the mixing coefficients constrained by . The parameters used in the simulations are , , and , . The restrained ensemble simulations (Eq. 2) and the maximum entropy method are applied to a potential with , restraining two observables, and , to reference values that are equal to the observables’ mean values in a reference distribution that has . The restraining strength is .
For sufficiently large , Eq. LABEL:eqn:D-expected predicts the expected values of both the restrained and unrestrained observables in quantitative agreement with their mean values in the restrained ensemble simulations (Fig. 1). As increases, the distribution generated by the restrained ensemble simulations tends to the unbiased distribution (). Notably, is relatively insensitive to around the largest value of for which the expected values of the restrained observables remain close to their reference values .
My results suggest the following method of setting so that the simulations are minimally restrained by experimental data: Iteratively perform a series of biased simulations of , updating by either
[TABLE]
or other update schemes Stinis (2005) such that . For each iteration, compute using Eq. LABEL:eqn:restrained-D-expected and select the largest so that the expected values of all the restrained observables are within the measurement uncertainties from their experimental values. Continue until no longer changes.
I illustrate this method of setting by MD simulations of the polyalanine peptide Ala5 restrained by the couplings measured by NMR experiments Graf et al. (2007); Wickstrom, Okur, and Simmerling (2009). All MD simulations were performed using the OpenMM software Eastman et al. (2013), with Amber99SB-ILDN force field Lindorff-Larsen et al. (2010) and a generalized Born implcit solvent model Onufriev, Bashford, and Case (2004). Langevin integrator Leimkuhler and Matthews (2015) was used to maintain the temperature of the molecular system at 300 K. When Ala5 was simulated without any restraints, there was substantial discrepancy between the predictions by the unbiased simulation and the experimental values for the couplings between H and Hα (denoted as for ), which have been used to characterize the four backbone torsions of Ala5 Vögeli et al. (2007) (Fig. 2). Below, were used as the restrained observables.
The same molecular system was then simulated with a series of biasing potentials (Eq. 3), in which in the ’th simulation was determined by Eq. 19. Eq. LABEL:eqn:restrained-D-expected was used to predict the expected values of the restrained observables () in the restrained ensemble simulations for different values of , using the values of and the corresponding and in different biased simulations as inputs. The largest () for which the predicted values of all the restrained observables are within the statistical uncertainty from the experimental values is determined for each . In this case, only 1 iteration of biased simulation was necessary to have a converged estimate of (Fig. 2).
In practice, it is easier to select the number of replicas based on the available computing resources (e.g., the number of CPU cores). Once has been determined by the above procedure, one can perform the restrained ensemble simulations using replicas and the scaled strength of restraints , because the restrained ensemble simulations with the parameters and with the parameters generate equivalent marginal distributions (Eq. LABEL:eqn:marginal-func-of-NiK).
My results suggest that the restrained ensemble simulations at large are equivalent to a single-replica simulation of the reweighted distribution in Eq. 12 (which becomes Eq. LABEL:eqn:marginal-func-of-NiK as ). In this case, can be regarded as a parameter that tunes the strength of the restraints, which are proportional to the inverse of the covariance of the experimental measurements. As tends to infinity, the resulting distribution becomes unbiased. To demonstrate this reweighting approach, the Ala5 system was simulated according to the marginal distribution in Eq. 12 for different values of , using different values of and the corresponding and as inputs. For , the expected values of the restrained () and unrestrained ( and the radius of gyration ) observables in these reweighted simulations are in good agreement with the corresponding values predicted by Eq. LABEL:eqn:D-expected (Fig. 2). Such reweighted simulations may offer a simple alternative to replica-based restrained ensemble simulations as a way to incorporate into MD simulations experimental data with their associated measurement uncertainties.
IV Acknowledgments
H. X. thanks Woody Sherman for helpful suggestions on the manuscript.
V Notes
The author declares no competing financial interest.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Dror et al. (2012) R. O. Dror, R. M. Dirks, J. Grossman, H. Xu, and D. E. Shaw, “Biomolecular Simulation: A Computational Microscope for Molecular Biology,” Annual Review of Biophysics 41 , 429–452 (2012) . · doi ↗
- 2Bottaro and Lindorff-Larsen (2018) S. Bottaro and K. Lindorff-Larsen, “Biophysical experiments and biomolecular simulations: A perfect match?” Science 361 , 355–360 (2018) . · doi ↗
- 3Gunsteren, Dolenc, and Mark (2008) W. F. v. Gunsteren, J. Dolenc, and A. E. Mark, “Molecular simulation as an aid to experimentalists,” Current Opinion in Structural Biology 18 , 149–153 (2008) . · doi ↗
- 4Boomsma et al. (2014) W. Boomsma, P. Tian, J. Frellsen, J. Ferkinghoff-Borg, T. Hamelryck, K. Lindorff-Larsen, and M. Vendruscolo, “Equilibrium simulations of proteins using molecular fragment replacement and NMR chemical shifts,” Proceedings of the National Academy of Sciences 111 , 13852–13857 (2014) . · doi ↗
- 5Perez et al. (2016) A. Perez, J. A. Morrone, E. Brini, J. L. Mac Callum, and K. A. Dill, “Blind protein structure prediction using accelerated free-energy simulations,” Science Advances 2 , e 1601274 (2016) . · doi ↗
- 6Lindorff-Larsen et al. (2005) K. Lindorff-Larsen, R. B. Best, M. A. De Pristo, C. M. Dobson, and M. Vendruscolo, “Simultaneous determination of protein structure and dynamics,” Nature 433 , 128 (2005) . · doi ↗
- 7Vendruscolo (2007) M. Vendruscolo, “Determination of conformationally heterogeneous states of proteins,” Current Opinion in Structural Biology 17 , 15–20 (2007) . · doi ↗
- 8Roux and Islam (2013) B. Roux and S. M. Islam, “Restrained-Ensemble Molecular Dynamics Simulations Based on Distance Histograms from Double Electron–Electron Resonance Spectroscopy,” The Journal of Physical Chemistry B 117 , 4733–4739 (2013) . · doi ↗
