Self Consistent Path Sampling: Making Accurate All-Atom Protein Folding   Simulations Possible on Small Computer Clusters

S. Orioli; S. A Beccara; P. Faccioli

arXiv:1705.02180·physics.bio-ph·May 8, 2017

Self Consistent Path Sampling: Making Accurate All-Atom Protein Folding Simulations Possible on Small Computer Clusters

S. Orioli, S. A Beccara, P. Faccioli

PDF

Open Access

TL;DR

This paper presents a novel iterative algorithm for accurate all-atom protein folding simulations that significantly reduces computational costs, enabling such simulations on small computer clusters.

Contribution

The authors introduce a new path sampling algorithm based on the path integral formalism that efficiently computes protein folding pathways with realistic force fields.

Findings

01

Validated on a fast folding protein, matching ultra-long MD simulations.

02

Reduces computational cost from supercomputer to desktop level.

03

Provides a stochastic estimate of the reaction coordinate.

Abstract

We introduce a powerful iterative algorithm to compute protein folding pathways, with realistic all-atom force fields. Using the path integral formalism, we explicitly derive a modified Langevin equation which samples directly the ensemble of reactive pathways, exponentially reducing the cost of simulating thermally activated transitions. The algorithm also yields a rigorous stochastic estimate of the reaction coordinate. After illustrating this approach on a simple toy model, we successfully validate it against the results of ultra-long plain MD protein folding simulations for a fast folding protein (Fip35), which were performed on the Anton supercomputer. Using our algorithm, computing a folding trajectory for this protein requires only 1000 core hours, a computational load which could be even carried out on a desktop workstation.

Figures7

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Parameters used in the definition of the two-dimensional asymmetric funnelled energy surface, Eq. ( IV.1 )

$A_{1}$	$A_{2}$	$A_{3}$	$s_{1}$	$s_{2}$	$s_{3}$	$w$	$y_{m}$	$x_{m}$
30	20	6	1	2	0.5	0.03	0	1.5

Equations87

m_{i} \ddot{x}_{i}

m_{i} \ddot{x}_{i}

p (X_{N}, t ∣ X_{U}) = \int_{X_{U}}^{X_{N}} D X e^{- S [X]},

p (X_{N}, t ∣ X_{U}) = \int_{X_{U}}^{X_{N}} D X e^{- S [X]},

S [X] \equiv i = 1 \sum N Γ_{i} \int_{0}^{t} d τ (m_{i} \ddot{x}_{i} + m_{i} γ_{i} \dot{x}_{i} + \nabla_{i} U)^{2}

S [X] \equiv i = 1 \sum N Γ_{i} \int_{0}^{t} d τ (m_{i} \ddot{x}_{i} + m_{i} γ_{i} \dot{x}_{i} + \nabla_{i} U)^{2}

P_{U \to N} (t)

P_{U \to N} (t)

F_{i} (X, z_{m})

F_{i} (X, z_{m})

z (X)

z (X)

C_{ij} (X) = \frac{1 - ( \frac{∣ x _{i} - x _{j} ∣}{r _{0}} ) ^{6}}{1 - ( \frac{∣ x _{i} - x _{j} ∣}{r _{0}} ) ^{10}}

C_{ij} (X) = \frac{1 - ( \frac{∣ x _{i} - x _{j} ∣}{r _{0}} ) ^{6}}{1 - ( \frac{∣ x _{i} - x _{j} ∣}{r _{0}} ) ^{10}}

p_{r M D} (X_{N}, t ∣ X_{U}) \equiv \int_{X_{U}}^{X_{N}} D X \int_{z (X_{U})} D z_{m} e^{- S_{r M D} [X, z_{m}]}

p_{r M D} (X_{N}, t ∣ X_{U}) \equiv \int_{X_{U}}^{X_{N}} D X \int_{z (X_{U})} D z_{m} e^{- S_{r M D} [X, z_{m}]}

\cdot δ [\overset{z}{˙}_{m} (τ) - \overset{z}{˙} (X) θ (- \overset{z}{˙} (X)) θ (z_{m} (τ) - z (X))]

S_{r M D}

S_{r M D}

T [X] = i = 1 \sum N \frac{1}{m _{i} γ _{i}} \int_{0}^{t} d τ ∣ F^{i} [X (τ)] ∣^{2}

T [X] = i = 1 \sum N \frac{1}{m _{i} γ _{i}} \int_{0}^{t} d τ ∣ F^{i} [X (τ)] ∣^{2}

p (X_{N}, t ∣ X_{U}) = \int_{X_{U}}^{X_{N}} D X \cdot e^{- S [X]} \int_{\overset{s}{ˉ} (0)} D s_{m} \int_{\overset{w}{ˉ} (0)} D w_{m}

p (X_{N}, t ∣ X_{U}) = \int_{X_{U}}^{X_{N}} D X \cdot e^{- S [X]} \int_{\overset{s}{ˉ} (0)} D s_{m} \int_{\overset{w}{ˉ} (0)} D w_{m}

\cdot δ [\overset{w}{˙}_{m} - \dot{\overset{w}{ˉ}} θ (- \dot{\overset{w}{ˉ}}) θ (w_{m} - \overset{w}{ˉ})] δ [\overset{s}{˙}_{m} - \dot{\overset{s}{ˉ}} θ (- \dot{\overset{s}{ˉ}}) θ (s_{m} - \overset{s}{ˉ})],

\overset{s}{ˉ} (τ)

\overset{s}{ˉ} (τ)

\overset{w}{ˉ} (τ)

\overset{s}{ˉ} (τ)

\overset{s}{ˉ} (τ)

\overset{w}{ˉ} (τ)

s_{λ} [X, τ]

s_{λ} [X, τ]

w_{λ} [X, τ]

∣∣ C_{ij} (τ) - C_{ij} (t^{'}) ∣ ∣^{2} \equiv \frac{\sum _{∣ i - j ∣ > 35}^{N} [ C _{ij} [ X ( τ )] - C _{ij} [ X ( t ^{'} )] ] ^{2}}{\sum _{∣ i - j ∣ > 35}^{N} [ C _{ij}^{0} ] ^{2}}

∣∣ C_{ij} (τ) - C_{ij} (t^{'}) ∣ ∣^{2} \equiv \frac{\sum _{∣ i - j ∣ > 35}^{N} [ C _{ij} [ X ( τ )] - C _{ij} [ X ( t ^{'} )] ] ^{2}}{\sum _{∣ i - j ∣ > 35}^{N} [ C _{ij}^{0} ] ^{2}}

p (X_{N}, t ∣ X_{U}) = λ \to \infty lim p_{λ} (X_{N}, t ∣ X_{U}),

p (X_{N}, t ∣ X_{U}) = λ \to \infty lim p_{λ} (X_{N}, t ∣ X_{U}),

p_{λ} (X_{N} t ∣ X_{U}) \equiv \int_{X_{U}}^{X_{N}} D X \int D s_{m} \int D w_{m} e^{- S_{λ} [X, s_{m}, w_{m}]}

p_{λ} (X_{N} t ∣ X_{U}) \equiv \int_{X_{U}}^{X_{N}} D X \int D s_{m} \int D w_{m} e^{- S_{λ} [X, s_{m}, w_{m}]}

δ [\overset{s}{˙}_{m} - \overset{s}{˙}_{λ} θ (- \overset{s}{˙}_{λ}) θ (s_{m} - s_{λ})] δ [\overset{w}{˙}_{m} - \overset{w}{˙}_{λ} θ (- \overset{w}{˙}_{λ})

θ (w_{m} - w_{λ})] .

S_{λ} \equiv \int_{0}^{t} d τ i = 1 \sum N Γ_{i} [m_{i} \ddot{x}_{i} + m_{i} γ_{i} \dot{x}_{i} + \nabla_{i} U - F_{i}^{w} - F_{i}^{s}]^{2}

S_{λ} \equiv \int_{0}^{t} d τ i = 1 \sum N Γ_{i} [m_{i} \ddot{x}_{i} + m_{i} γ_{i} \dot{x}_{i} + \nabla_{i} U - F_{i}^{w} - F_{i}^{s}]^{2}

F_{i}^{w} (X, w_{m}) = - k_{w} \nabla_{i} w_{λ} (w_{λ} - w_{m}) θ (w_{λ} - w_{m})

F_{i}^{w} (X, w_{m}) = - k_{w} \nabla_{i} w_{λ} (w_{λ} - w_{m}) θ (w_{λ} - w_{m})

F_{i}^{s} (X, s_{m}) = - k_{s} \nabla_{i} s_{λ} (s_{λ} - s_{m}) θ (s_{λ} - s_{m})

(w_{m} - w_{λ}) \to 0 and (s_{m} - s_{λ}) \to 0.

(w_{m} - w_{λ}) \to 0 and (s_{m} - s_{λ}) \to 0.

⟨ C_{ij} (t^{'})⟩ = \frac{\int _{X_{U}}^{X_{N}} D X C _{ij} [ X ( t ^{'} )] e ^{- S [X]}}{\int _{X_{U}}^{X_{N}} D X e ^{- S [X]}}

⟨ C_{ij} (t^{'})⟩ = \frac{\int _{X_{U}}^{X_{N}} D X C _{ij} [ X ( t ^{'} )] e ^{- S [X]}}{\int _{X_{U}}^{X_{N}} D X e ^{- S [X]}}

s_{λ} [X (τ)] ≃ 1 - \frac{\frac{1}{t} \int _{0}^{t} d t ^{'} t ^{'} e ^{- λ ∣∣ C_{ij} [X (τ)] - ⟨ C_{ij} (t^{'})⟩ ∣ ∣^{2}}}{\int _{0}^{t} d t ^{'} e ^{- λ ∣∣ C_{ij} [X (τ)] - ⟨ C_{ij} (t^{'})⟩ ∣ ∣^{2}}}

s_{λ} [X (τ)] ≃ 1 - \frac{\frac{1}{t} \int _{0}^{t} d t ^{'} t ^{'} e ^{- λ ∣∣ C_{ij} [X (τ)] - ⟨ C_{ij} (t^{'})⟩ ∣ ∣^{2}}}{\int _{0}^{t} d t ^{'} e ^{- λ ∣∣ C_{ij} [X (τ)] - ⟨ C_{ij} (t^{'})⟩ ∣ ∣^{2}}}

w_{λ} [X (τ)] ≃ w_{0} - lo g \int_{0}^{t} d t^{'} e^{- λ ∣∣ C_{ij} [X (τ)] - ⟨ C_{ij} (t^{'})]⟩ ∣ ∣^{2}} .

T [X] = i = 1 \sum N Γ_{i} \int_{0}^{t} d τ ∣ F_{i}^{w} [X (τ)] + F_{i}^{s} [X (τ)] ∣^{2}

T [X] = i = 1 \sum N Γ_{i} \int_{0}^{t} d τ ∣ F_{i}^{w} [X (τ)] + F_{i}^{s} [X (τ)] ∣^{2}

σ (X) \equiv \frac{1}{N _{U} t} k = 1 \sum N_{U} \frac{\int _{0}^{t} d t ^{'} t ^{'} e ^{- λ ∣∣ C_{ij} [X (τ)] - ⟨ C_{ij} (t^{'}) ⟩_{λ}^{k} ∣ ∣^{2}}}{\int _{0}^{t} d t ^{'} e ^{- λ ∣∣ C_{ij} [X (τ)] - ⟨ C_{ij} (t^{'}) ⟩_{λ}^{k} ∣ ∣^{2}}} .

σ (X) \equiv \frac{1}{N _{U} t} k = 1 \sum N_{U} \frac{\int _{0}^{t} d t ^{'} t ^{'} e ^{- λ ∣∣ C_{ij} [X (τ)] - ⟨ C_{ij} (t^{'}) ⟩_{λ}^{k} ∣ ∣^{2}}}{\int _{0}^{t} d t ^{'} e ^{- λ ∣∣ C_{ij} [X (τ)] - ⟨ C_{ij} (t^{'}) ⟩_{λ}^{k} ∣ ∣^{2}}} .

U (x, y)

U (x, y)

F_{i}^{w} (X, w_{m}) = - k_{w} \nabla w_{λ} (w_{m} - w_{λ}) θ (w_{λ} - w_{m})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProtein Structure and Dynamics · RNA and protein synthesis mechanisms · Enzyme Structure and Function

Full text

Self Consistent Path Sampling: Making Accurate All-Atom Protein Folding Simulations Possible on Small Computer Clusters

S. Orioli

Dipartimento di Fisica, Università degli Studi di Trento, Via Sommarive 14, Povo (Trento), I-38123 Italy

INFN-TIFPA Via Sommarive 14, Povo (Trento), I-38123 Italy

S. a Beccara

Dipartimento di Fisica, Università degli Studi di Trento, Via Sommarive 14, Povo (Trento), I-38123 Italy

INFN-TIFPA Via Sommarive 14, Povo (Trento), I-38123 Italy

P. Faccioli 111Corresponding author: [email protected]

[email protected]

Dipartimento di Fisica, Università degli Studi di Trento, Via Sommarive 14, Povo (Trento), I-38123 Italy

INFN-TIFPA Via Sommarive 14, Povo (Trento), I-38123 Italy

Abstract

We introduce a powerful iterative algorithm to compute protein folding pathways, with realistic all-atom force fields. Using the path integral formalism, we explicitly derive a modified Langevin equation which samples directly the ensemble of reactive pathways, exponentially reducing the cost of simulating thermally activated transitions. The algorithm also yields a rigorous stochastic estimate of the reaction coordinate. After illustrating this approach on a simple toy model, we successfully validate it against the results of ultra-long plain MD protein folding simulations for a fast folding protein (Fip35), which were performed on the Anton supercomputer. Using our algorithm, computing a folding trajectory for this protein requires only $\sim 10^{3}$ core hours, a computational load which could be even carried out on a desktop workstation.

pacs:

Valid PACS appear here

I Introduction

The protein folding pathway problem consists in clarifying the pattern of structural changes through which a given denaturated protein reaches its native structure Dill review ; PNAS review . Its solution would shine light on the main forces guiding the folding reaction and provide valuable insight on the origin of pathogenic misfolding events.

Even using the most powerful special-purpose supercomputer, plain Molecular Dynamics (MD) simulations of protein folding are feasible only for small chains (consisting of up to $\sim$ 100 amino acids), with folding time within the ms time scale Anton2 . On the other hand, most proteins involved in biologically relevant folding or misfolding reactions contain several hundreds of amino-acids and have folding times which can be as long as seconds, or even minutes.

To overcome the computational limitations of plain MD simulations, more advanced algorithms have been proposed in literature, see e.g. Ref.s TPS ; milestoning ; MSM1 ; metadynamics ; TAMD ; Tuckerman . Some of these techniques were successfully applied to investigate the kinetics or thermodynamics of structural reactions involving polypeptide chains, including the protein-ligand binding or even the folding of small protein fragments. However, the very slow folding reactions of complex proteins are still much beyond the reach of any of these techniques.

To our knowledge, the only reaction path sampling approach which has been successfully applied to characterise in full atomistic detail folding reactions of large and topologically complex proteins is the so-called Bias Functional (BF) approach BFA . For example, this method was recently used to investigate folding and misfolding of several serpin proteins, which are made of nearly 400 amino-acids and have folding times as long as tens of minutes. It was shown that not only the BF method agrees with all existing experimental information on the folding mechanism, but also correctly predicts the effect of point mutations on the protein misfolding propensity Serpinfolding . In Ref. PNAS2 a preliminary version of this algorithm PNAS1 was used to study a large conformational transition of the same proteins, which occurs over about one hour. In IM7IM9 it was used to explain the puzzle of different folding kinetics of two structurally identical proteins, while in DRPknots it was applied to explore the folding mechanism of a protein with a knotted native state.

The BF method exploits a rigorous variational principle to select the most reliable folding trajectory within a set of trial pathways, previously generated by means of a specific type of biased dynamics, called ratchet-and-pawl MD (rMD)rMD1 ; rMD2 . In a rMD simulation, no bias is applied to the protein, as long as it spontaneously progresses towards the native state. An harmonic history-dependent force is introduced only to discourage spontaneous backtracking towards the reactant.

Clearly, if this biasing force was defined in terms of a good reaction coordinate –for example, the direction orthogonal to the iso-commitor hyper-surfaces in the protein configuration space– then the rMD scheme would provide the correct description of the folding mechanism. In practice, however, rMD simulations of protein folding are biased along the direction set by a specific collective coordinate rMD2 , closely related to the instantaneous fraction of native contacts. Even though the BF variational condition is expected to improve on the results of plain rMD simulations, a sub-optimal choice of biasing coordinate may give rise to systematic errors which are hard to estimate a priori.

In this work, we introduce a reaction path sampling algorithm which enables to generate protein folding trajectories without relying on any model-dependent choice of biasing coordinate. Instead, the reaction coordinate is derived self-consistently and represents an output of the calculation, providing insight into the folding mechanism.

This new scheme is not heuristically postulated, but rather it follows directly from the Langevin dynamics, with no additional approximation other than a mean-field estimate of some auxiliary variable. Direct comparison with the results of ultra-long plain MD simulations performed on the Anton supercomputer show that it provides a realistic representation of the reactive dynamics. The computational cost of this new scheme is only a factor 2-3 larger than that of standard BF simulations, thus still extremely low.

In the next section, we review the path integral representation of the Langevin dynamics, briefly discuss the standard rMD scheme and the BF approach for protein folding simulations. Section III contains the main results of this work, providing the mathematical derivation of our new self-consistent algorithm. In the subsequent section, we first illustrate its implementation on a very simple toy model and then we apply it to perform a realistic all-atom protein folding calculations, benchmarking the results against those obtained from ultra long plain MD simulations. The main results are summarized in the conclusion section.

II Theoretical Setup

Throughout this paper we shall assume that the atoms in the protein obey the Langevin equation

[TABLE]

where $X=({\bf x}_{1},\ldots,{\bf x}_{N})$ denotes the collection of all atomic coordinates, $-\nabla_{i}U(X)$ is an atomistic force field, $\eta_{i}$ is a delta-correlated white noise obeying standard fluctuation-dissipation relationship and $m_{i}$ and $\gamma_{i}$ denote the atomic masses and viscosity, respectively. Note that, for sake of notational simplicity, we are considering here models with an implicit solvent description. However, all the results of the present work hold also for an explicit solvent description, as long as the dynamics of the solvent molecules is described by a Langevin equation.

Within the stochastic dynamics defined by Eq. (1), the conditional probability density $p(X_{N},t|X_{U})$ for the protein to perform a transition from an arbitrary denatured configurations $X_{U}$ to a native configuration $X_{N}$ in a time interval $t$ can be written in path integral representation:

[TABLE]

where $S[X]$ is the so-called Onsager-Machlup (OM) action:

[TABLE]

where $\Gamma_{i}=\frac{1}{4\gamma_{i}m_{i}k_{B}T}$ .

The probability for the protein to be in the folded state at time $t$ , provided it was unfolded at the initial time is obtained by integrating the point-to-point conditional probability (2) over the final configurations in the native state and averaging over initial conditions in the unfolded state, i.e.

[TABLE]

where $h_{U}(X)$ and $h_{N}(X)$ are the characteristic functions of the unfolded and native state, respectively, and $Z$ is the system’s partition function. Clearly, this probability is exponentially small for time intervals $t$ much smaller than the inverse folding rate $1/k_{f}$ .

II.1 Ratchet-and-Pawl MD and Bias Functional

In order to set the stage for introducing our self-consistent path sampling algorithm, it is instructive to first review the standard rMD formalism and the related BF approach. rMD is an algorithm which enables to generate folding trajectories in time intervals $t$ much smaller the inverse folding rate. In this dynamics, an unphysical biasing force is introduced to discourage backtracking towards the unfolded state rMD1 ; rMD2 :

[TABLE]

Here, $z(X)$ is a collective coordinate defined as

[TABLE]

which measures a Frobenius-type distance between the instantaneous contact map $C_{ij}(X)$ and the native contact map $C_{ij}^{0}=C_{ij}(X_{N})$ , with continuous entries defined by

[TABLE]

where $r_{0}$ is an arbitrary reference distance, typically set to $7.5$ Å. Note that the constraint $|i-j|>35$ is introduced in order to exclude topologically closed atoms, whose relative distance is restrained by the covalent bonds. Furthermore, in order to enforce a linear scaling of the computational cost with the number of atoms, a cut-off is usually introduced that sets to 0 the entries $C_{ij}(r_{ij})$ for atomic distances larger than a threshold, $r_{ij}>r_{c}$ . A typical value is $r_{c}\simeq 1.2$ nm.

In Eq. (5) $z_{m}(\tau)$ is the minimum value assumed by the collective variable $z$ along the rMD trajectory, up to time $\tau$ . Note that the biasing force (5) is not active whenever the chain spontaneously evolves towards more native-like configurations ( $z(t+\Delta t)<z_{m}$ ). It sets in only to discourage backtracking towards the unfolded state, i.e. for $z(t+\Delta t)>z_{m}$ .

The path integral representation of the conditional probability to perform a transition from $X_{U}$ to $X_{N}$ in time $t$ in rMD performed within the Langevin dynamics was introduced in Ref. BFA and reads:

[TABLE]

(Throughout this paper, the Heaviside functions are conventionally defined in order to satisfy $\theta(x)=1$ for $x=0$ . )

The expression (II.1) contains the path integral over an auxiliary time-dependent variable $z_{m}(\tau)$ . We note that the dynamics of such a variable is frozen any time $z_{m}$ becomes smaller than $z(X)$ and any time the collective coordinate $z(X)$ is increasing. Its time derivative is otherwise set equal to $\dot{z}(X)$ . Therefore, by choosing the initial conditions $z_{m}(0)=z(X(0))$ , $z_{m}(\tau)$ is identically set equal to the minimum value attained by the collective coordinate $z$ until time $\tau$ (see left panel of Fig.1).

The functional $S_{\small{rMD}}[X,z_{m}]$ in the exponent of Eq. (II.1) coincides with an OM action with the addition of the unphysical biasing force ${\bf F}_{i}$ :

[TABLE]

The contribution of the bias to the OM action exponentially enhances the weight of short folding pathways, making the folding probability $P^{\textrm{rMD}}_{U\to F}(t)\sim 1$ for time intervals $t$ much shorter than the inverse folding rate, $t\ll 1/k_{F}$ . The prize to pay for such a computational efficiency is that of introducing uncontrolled systematic errors, which arise because the biasing coordinate $z$ may not be an optimal reaction coordinate. Furthermore, the structure of the biasing force explicitly breaks microscopic reversibility, making it impossible for rMD to directly access thermal equilibrium.

The systematic errors introduced by the rMD biasing scheme can be kept to a minimum by applying the variational principle which defines the BF approach BFA . Indeed, it was shown that the trajectories generated by rMD which have the largest probability to be realized in an unbiased Langevin simulation (i.e. for ${\bf F}^{i}=0$ ) are those with the least value of the so-called Bias Functional

[TABLE]

Thus, in the BF approach, one generates many trial folding trajectories by rMD and then uses this variational scheme to identify the least biased pathway, which represents the variational prediction.

III Self Consistent Path Sampling

Let us now introduce our new algorithm, which provides major improvement with respect to the rMD and BF schemes discussed in the previous section. Indeed, it follows directly from the unbiased Langevin equation and allows us to remove the systematic errors associated to the choice of biasing coordinate.

Our starting point is path integral representation of the unbiased Langevin dynamics (2). We introduce two dumb auxiliary variables $w_{m}(\tau)$ and $s_{m}(\tau)$ into this path integral by means of appropriate functional Dirac deltas:

[TABLE]

where $\bar{s}(\tau)$ and $\bar{w}(\tau)$ are two external time-dependent functions to be defined below. In analogy with the path integral representation of the rMD, the auxiliary variables $s_{m}(\tau)$ and $w_{m}(\tau)$ are identically equal to the minimum value attained by $\bar{s}$ and $\bar{w}$ , until time $\tau$ . On the other hand, we stress that the dynamics described by the path integral (III) is still unbiased, for any choice of $\bar{s}$ and $\bar{w}$ .

Let us now specialize, and define the external functions as follows

[TABLE]

Since $\bar{s}$ and $\bar{w}$ are never increasing, it follows that $s_{m}(\tau)=\bar{s}(\tau)$ and $w_{m}(\tau)=\bar{w}(\tau)$ for all times in the interval $\tau\in[0,t]$ (see right panel of Fig.1).

We now observe that Eq.s (12) and (13) can be equivalently written as follows

[TABLE]

where $s_{\lambda}$ and $w_{\lambda}$ are two functionals of the path $X(\tau)$ which depend also explicitly on time $\tau$ :

[TABLE]

In these expressions $C_{ij}(\tau)$ are the instantaneous $ij$ entry of a contact map matrix (7). The symbol $||\ldots||$ denotes a normalized Frobenius-type distance:

[TABLE]

where $C_{0}$ is the contact map calculated on the native structure of the protein. The identities (14) and (15) are explicitly proven in appendix A, by first discretizing the time integrals in Eq.s (16) and (17) and then noticing that, in the large $\lambda$ limit, the contribution of all time slices with $t^{\prime}\neq\tau$ is suppressed.

Using such equalities, the original conditional probability density (2) can be exactly re-written as the following limit:

[TABLE]

where

[TABLE]

Note that the exponent in the second equation contains a new functional $S_{\lambda}[X,s_{m},w_{m}]$ . This is defined in order to maintain the structure of an OM action, but includes two additional “forces” ${\bf F}^{w}_{i}$ and ${\bf F}^{s}_{i}$ :

[TABLE]

These forces depend explicitly on $w_{m}$ and $s_{m}$ , respectively and implicitly on the instantaneous configuration $X$ , through the collective variables $s_{\lambda}$ and $w_{\lambda}$ :

[TABLE]

Their definition closely resamples that of the rMD force – cfr Eq. (5)–. However, we recall that in the large $\lambda$ limit

[TABLE]

Thus, for sufficiently large $\lambda$ , the two forces ${\bf F}_{i}^{w}$ and ${\bf F}_{i}^{s}$ are in fact always negligible, and $S_{\lambda}[X,s_{m},w_{m}]$ reduces to the standard OM action $S[X]$ , proving the equivalence between Eq. (19) and the original Langevin conditional probability (2).

We now introduce our only approximation to the Langevin dynamics (1): it consists in replacing the instantaneous value of the contact map $C_{ij}[X(t^{\prime})]$ in the exponents in the Eq.s (16) and (17) with the average value $\langle C_{ij}(t^{\prime})\rangle$ ,

[TABLE]

leading to

[TABLE]

Within this approximation, $w_{\lambda}$ and $s_{\lambda}$ stop depending functionally on the entire path $X(\tau)$ and reduce to ordinary collective coordinates, i.e. functions of the instantaneous configuration $X(\tau)$ . They represent a specific realization of the tube variables, introduced in Ref. tubevar and their geometric interpretation is illustrated in Fig. 2: $s_{\lambda}$ measures the progress of the reaction using as reference the self-consistently calculated average folding path, represented in contact map space. Similarly, $w_{\lambda}$ measures in the same space the distance of a configuration from the average folding pathway. We note that the original tube variables introduced in Ref. tubevar were defined in terms of a fixed external path and involved the Euclidean distance in configuration space, instead of a distance in contact map space. Using such a norm, however, would not enable to define a computationally viable path sampling algorithm.

After making the mean-field replacement $C_{ij}[X(\tau)]\to\langle C_{ij}(\tau)\rangle$ , the auxiliary variables $s_{m}$ and $w_{m}$ are no longer identically equal to the collective variable $s_{\lambda}$ and $w_{\lambda}$ , thus the forces ${\bf F}^{w}_{i}$ and ${\bf F}^{s}_{i}$ do not vanish –see Eqs. (23) and (22)–. As a consequence, the path integral (19) defines a new type of rMD, with biasing forces setting in only when the collective coordinates $s_{\lambda}$ and $w_{\lambda}$ exceed their minimum value attained along the trajectory. However, unlike in the standard rMD, the biasing coordinates $s_{\lambda}$ and $w_{\lambda}$ are not arbitrarily defined a priori. Instead, they are determined self-consistently from the reactive pathways and encode the information on the average protein folding pathways in contact map space. In this sense, in this self-consistent type of rMD, the biasing forces act along a good reaction coordinate, thus removing the systematic uncertainties of the standard rMD.

The systematic errors introduced by the mean-field approximation $C_{ij}[X(\tau)]\to\langle C_{ij}(\tau)\rangle$ can be kept to a minimum by applying the variational principle of the BF approach: among the paths generated within this approximation, those with largest probability to occur in the absence of any bias (i.e. after completely relaxing the mean-field approximation) are the ones for which the functional

[TABLE]

is least.

Based on this new approximate path integral representation of the Langevin dynamics, protein folding pathways can be sampled by means of the following iterative reaction path sampling algorithm, which we shall name Self-Consistent Path Sampling (SCPS):

An initial denatured conditions $X_{U}$ is generated, for example through a thermal unfolding MD simulation, started from the native structure $X_{N}$ ; 2. 2.

By running several standard rMD simulations starting from $X_{U}$ , an ensemble of trial folding pathways reaching the native state within a given time interval $t$ is generated; 3. 3.

Using the trajectories evaluated in the previous step, the average contact map $\langle C_{ij}(\tau)\rangle$ is computed for many intermediate times using Eq. (25), and the collective variables $s_{\lambda}$ and $w_{\lambda}$ are obtained from Eq.s (16) and (17), using a large value of $\lambda$ ; 4. 4.

A new ensemble of trial folding pathways starting from the initial configuration $X_{U}$ is obtained by performing simulations in the new type of rMD, i.e. introducing the biasing forces ${\bf F}_{i}^{w}$ and ${\bf F}^{s}_{i}$ , based on the collective coordinates $s_{\lambda}$ and $w_{\lambda}$ evaluated at Step 3; 5. 5.

Step 3 and 4 are iterated until convergence is reached (a criterium to assess it is discussed in the next session); 6. 6.

The set of folding trajectories generated at the last iteration is scored according to the bias functional $T[X]$ using Eq. (28) and the least biased path is retained.

Repeating the calculation starting from different unfolded initial conditions $X_{U}^{1},\ldots,X_{U}^{N_{U}}$ leads to an ensemble of folding pathways.

From the results of these $N_{U}$ independent calculations it is possible to define a global collective coordinate $\sigma(X)$ which measures the overall progress of the folding reaction. To this end, we combine the $N_{U}$ tube variables $s_{\lambda}^{1},\ldots,s_{\lambda}^{N_{U}}$ , calculated from the folding pathways started from different initial conditions (see right panel of Fig.1):

[TABLE]

In this equation, $\langle C_{ij}(\tau)\rangle_{\lambda}^{k}$ is the average contact map in the calculation started from the initial condition $X_{U}^{k}$ .

IV Illustration and Validation

IV.1 Diffusion in a 2-dimensional asymmetric funnel

In order to illustrate how our self-consistent sampling scheme works, it is instructive to first apply it to a simple toy model. In particular, we study the diffusion of a particle on the two-dimensional energy surface introduced in the Supplementary Material (SM) of Ref. BFA and defined by

[TABLE]

Using the parameters reported in Table 1, this function generates the asymmetric funnelled energy landscape shown in the upper panel of Fig 3.

At low temperature (we choose $k_{B}T=0.3$ ) the transition across the barrier is thermally activated and the small size of the gate provides an entropic barrier. Consequently, all reactive trajectories generated by integrating the standard over damped Langevin equation with $\gamma\Delta t=0.02$ and initiated from an initial condition in the external ring $(x_{i}=0,y_{i}=5)$ spend an exponentially long time before reaching the bottom of the funnel by passing through the asymmetric gate, as shown in the upper panel of Fig. 3.

Let us now discuss the results obtained simulating the same transition using the SCPS algorithm. We began by performing 1000 standard rMD simulations, using an harmonic biasing force with $k_{\textrm{R}}=70$ ( see Eq. (5) ) acting along the direction set by the Euclidean distance of the particle from the center of the funnel, $z_{\textrm{rMD}}(x,y)\equiv\sqrt{x^{2}+y^{2}}$ . We emphasize that we deliberately chose to work in a worst-case scenario, i.e. we applied very a strong rMD bias ( with strength comparable to that of the physical force) and used a very bad reaction coordinate, which ignores the existence of the gate through which the physical reaction pathways reach the bottom.

The result of this rMD simulation are shown in the right panel of Fig 3. As expected, a significant fraction of the rMD reactive trajectories reaches the bottom of the funnel by directly crossing the barrier, thus providing a poor description of the reaction mechanism. However, the relative majority of such rMD trajectories still manages to find the gate. As a result, the average pathway in configuration place $\langle X(\tau)\rangle$ –which in this toy model plays the role of the average contact map $\langle C_{ij}(X)\rangle$ – displays a small bend towards the direction of the gate.

In all subsequent self-consistent iterations, we performed rMD simulations with the two biasing forces

[TABLE]

where $k_{s}=3$ and $k_{w}=3$ and the tube variables $w_{\lambda}$ and $s_{\lambda}$ are calculated according to

[TABLE]

where $||\ldots||$ denotes the Euclidean norm in configuration space. We checked that, choosing $\lambda=0.3$ , the exponents in the definition of path variables are $\gg 1$ for most time frames.

The results shown in the lower panels of Fig 3 illustrate how, already after the first iteration, the results are significantly improved with respect to plain rMD simulations. Indeed all reactive trajectories reach the bottom by passing through the gate. The second iteration leads results consistent with the previous one, indicating that convergence has been attained.

IV.2 Realistic all-atom protein folding simulations

Let us now assess the accuracy of the SCPS approach in a realistic protein folding simulation. In particular, we study the folding of Fip35, the WW protein domain shown in Fig.4, which represents a standard benchmark for protein folding simulations. Indeed, for this system, ultra-long plain MD trajectories displaying several unfolding/refolding events have been made available by DE Shaw Research Antonfip35 .

Force field: We used the AMBER99FS-ILDN force field Amber with the implicit solvent model implemented in GROMACS 4.6.5 GRO4 with PLUMED 2.0.2 PLUMED . In such an approach, the Born radii are calculated according to the Onufriev-Bashford-Case algorithm OBC . The hydrophobic tendency of non-polar residues is taken into account through an interaction term proportional to the atomic solvent accessible surface area. The solvent-exposed surface of the different atoms is calculated from the Born radii, according to the approximation developed by Schaefer, Bartels and Karplus in BornRadii .

SCPS Implementation: Details concerning the implementation of the steps of the SCPS algorithm introduced in the main text are given in order.

Generation of initial conditions: 5 independent initial conditions were generated via thermal unfolding, i.e. by running 100 ps of standard MD at T=800 K starting from the energy-minimized crystal structure. 2. 2.

Preliminary rMD simulations: From each denatured condition, 20 independent 500-ps long rMD trajectories were generated using a ratchet spring constant $k_{R}=10^{-4}$ kJ/mol. The temperature was set to T= 350 K, which is a reasonable value for protein folding studies, see e.g Antonfip35 . The integration time step was set $\Delta t=1$ fs and frames were saved every $0.5$ ps. 3. 3.

Self-consistent definition of collective variables: From each of the 5 initial conditions, the average atomic contact map $\langle C_{ij}(\tau)\rangle$ was computed every 7 ps, using the rMD trajectories which correctly reached the folded state, defined by a Root-Mean-Square-Deviation (RMSD) to the native state less than 4 Å. We have checked that using a much larger number of time frames does not significantly alter the results, yet considerably increases the computational cost of the calculation. The $s_{\lambda}$ and $w_{\lambda}$ collective variables defined by

[TABLE]

were calculated from the average contact maps $\langle C_{ij}(t)\rangle$ , using $\lambda$ = 13.5. If much smaller values of $\lambda$ are used, then many time-frames simultaneously contribute to the time integral, signalling that the large $\lambda$ condition is not fulfilled. Conversely, much larger values of $\lambda$ lead to lower computational efficiency. 4. 4.

Self-consistent rMD simulations: For each initial condition, 20 folding pathways were generated using the biasing forces:

[TABLE]

where $k_{s}=2.5$ kJ/mol, $k_{w}=2.5\times 10^{-4}$ kJ/mol, while $w_{m}(\tau)$ and $s_{m}(\tau)$ denote the minimum value attained by the collective coordinates $w_{\lambda}$ and $s_{\lambda}$ until time $\tau$ (see discussion in the main text). 5. 5.

Iteration: Steps 3 and 4 have be repeated for two iterations in 4 out of the 5 independent simulations (corresponding to different initial conditions) and for three iterations in the remaining simulation. 6. 6.

Variational correction: For each independent simulation, we selected the minimally biased trajectory among the ensemble of folding pathways generated at the last iteration by ranking them according to their bias functional

[TABLE]

Convergence analysis: The density plots shown in Fig. 5 illustrate the evolution of the ensemble of folding pathways generated at different SCPS iterations, for all 5 initial conditions. In order to assess the convergence of the iterative algorithm, we need to quantify how the folding pathways change from one iteration to the next. To this goal, we have devised the following heuristic procedure. Let $p^{(I)}(x,y)$ be a density distribution representing how many times, in the folding trajectories generated at the $I-$ th iteration, the RMSD to native of the two hairpins assumed values $x$ and $y$ respectively. In particular, in order to smear-out local fluctuations, we have divided the plane identified by the RMSD to native of the two hairpins in $20\times 20$ cells of length 0.8 Å. In the following, the index pair $ij$ is used label the different cells, thus $p^{(I)}(x,y)\to p^{(I)}_{ij}$ .

To identify the region visited on this plane by the folding pathways at the $I-th$ , we consider the binary matrix:

[TABLE]

which we further normalized to unit Frobenius norm. Finally, to compare $M_{ij}$ matrices obtained at different iterations $I$ and $J$ , we computed their Frobenius distance,

[TABLE]

When convergence has not yet been attained, we expect $D\left(M^{(I)},M^{(I+1)}\right)$ to decrease strongly when adding a new iteration, i.e. $D\left(M^{(I)},M^{(I+1)}\right)\gg D\left(M^{(I+1)},M^{(I+2)}\right)$ . In contrast, when convergence has been reached, we expect $D\left(M^{(I)},M^{(I+1)}\right)\sim D\left(M^{(I+1)},M^{(I+2)}\right)$ . Note that some difference between results obtained at different iterations is expected to persist even after many iterations, due to the intrinsic stochastic character of folding trajectories. We considered the self-consistent iterative procedure to be convergent if $D\left(M^{(I)},M^{(I+1)}\right)$ varies less than $10\%$ from an iteration to the next.

The plot in Fig. 6 shows the behaviour of $D\left(M^{(I)},M^{(I+1)}\right)$ for all the $5$ initial conditions employed in the self-consistent calculation. From these results it is possible to infer that convergence has been reached in the simulations corresponding to the initial conditions 1,2,3 and 5, while it has not yet been completely attained in the simulation associated to the initial condition 4. Note also that reaching convergence starting from the initial condition 3 required an additional iteration with respect to conditions 1,2 and 5.

Comparison with MD results: According to the results of plain MD simulations, in the main folding pathway of this protein, hairpin-1 is completely formed before hairpin 2 begins to fold. A less frequent alternate route is one in which the folding of the two hairpins occurs in reversed order krivov ; Pande . These two folding pathways are evident also in Fig. 4, which reports the folding pathways generated by MD, projected onto the plane defined by the Root-Means-Square-Deviation (RMSD) from their native structure of the two hairpins (dashed lines). The 5 folding pathways calculated with the SCPS algorithm are shown as solid lines and display the same behavior, qualitatively demonstrating the accuracy of this algorithm, after only a few self-consistent iterations.

In order to quantify the degree of agreement between the results of plain MD simulations and our SCPS simulations, we adopted the path similarity analysis developed in BFA . A matrix $\hat{M}$ is defined in order to describe the order in which the native contacts are formed rMD2 . Namely, let $i,j$ be two indexes running over all native contacts between $C_{\alpha}$ atoms, and let $t_{i}(k)$ and $t_{j}(k)$ be the times at which they are formed, i.e.

[TABLE]

A quantitative measure of the difference in the folding mechanisms followed by two given trajectories $k$ and $k^{\prime}$ is provided by their path similarity $s(k,k^{\prime})$ , defined as

[TABLE]

Notice that $s(k,k^{\prime})=1$ if all native contacts are formed in the same order in $k$ and in $k^{\prime}$ , and is [math] if they are formed in a completely different order. For comparison, we note that if $k$ and $k^{\prime}$ are two random sequences of native contact formation, then $s(k,k^{\prime})\sim 1/3$ .

We first computed the self-similarity distribution, i.e. the distribution of values of $s(k,k^{\prime})$ , where both $k$ and $k^{\prime}$ run within the ensemble of MD folding trajectories. This step is required in order to quantify the intrinsic degree of heterogeneity of the folding mechanism. Next, we computed the cross-similarity between the MD and the SCPS folding pathways, i.e. $s(k,k^{\prime})$ with $k$ and $k^{\prime}$ running over MD and SCPS trajectories, respectively. The overlap of the two distributions shown in the Fig. 7 indicates that the average difference between the folding mechanism obtained in the two methods lies within the intrinsic statistical fluctuations. Therefore, we can conclude that the two methods give the completely consistent folding mechanisms.

It is interesting to note that the self-consistent iterations significantly improve on the initial trial guess for reactive pathways, based on standard rMD. Indeed, in the first trial simulations based on rMD a significant fraction of trajectories follow a folding mechanism in which the two hairpins form simultaneously, i.e. drawing lines close to the diagonal, in the plain defined by the RMSD to native of the two hairpins. In the subsequent iterations, however, the statistical weight of such highly cooperative pathways is suppressed, leaving only pathways in which the hairpins form in order (” $L$ -” and ” $\Gamma-"$ shaped lines), thus improving the agreement with plain MD simulations.

V Conclusions

Using the path integral representation of the Langevin dynamics, we have explicitly shown that protein folding pathways can be directly sampled through a self-consistently defined type of ratchet-and-pawl MD.

Unlike other enhanced path sampling methods, the SCPS algorithm yields results which, at convergence, do not depend on any model-dependent choice of collective coordinate. Instead, the algorithm yields a rigorous stochastic estimate of the reaction coordinate.

We have assessed the accuracy of our algorithm by simulating the folding of a WW protein domain using a state-of-the-art atomistic force field, showing that it yields a folding mechanism completely consistent with the one obtained by means of ultra-long plain MD simulations on the Anton supercomputer.

An important note concerns the computational efficiency of this method. Each SCPS iteration requires to compute a few tens of very short ( $<1$ ns long) trial trajectories. The computational cost of this scheme is therefore only a few times larger than that of the BF approach, which has been used in the past to simulate large and complex folding processes, with small computer clusters.

Finally, we emphasize that even though in this work we chose to focus on the prototypical protein folding pathway problem, the SCPS algorithm may be applied to a much larger class of conformational reactions, for which structural information on the product state is experimentally available.

Acknowledgements.

The idea of developing a self-consistent formulation of rMD arose during a joint discussion with G. Tiana and C. Camilloni, who suggested to consider tube variables. We also thank DES Research for making available their MD trajectories for Fip35 folding. Part of the calculations were performed on Tier-0 Marconi at the CINECA supercomputing facility.

Appendix A A Mathematical Identity

In the following we provide the explicit proof of the identities (14) and (15).

We begin by discretising the time integration in Eq.s (16) and (17), with a time step $\Delta t=t/N$ . Labelling the intermediate times with $\tau=l\Delta t$ we obtain

[TABLE]

where we redefined the initial condition as $w^{\prime}_{0}=w_{0}-\frac{1}{\lambda}\log\Delta t$ . In the large $\lambda$ limit, all terms in these sums in with $l\neq k$ are suppressed, thus

[TABLE]

After restoring the continuous notation, we recover the definitions (12) and (13) thus proving the identities (14) and (15).

Bibliography26

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) S.W. Englander and L. Mayne, Proc. Natl. Acad. Sci. USA 111 , 15873 (2014)
2(2) A.K. Dill, S.B. Ozkan, M.S. Shell and T.R. Weikl Annu. Rev. Biophys. 37 , 289 (2008).
3(3) K. Lindorff-Larsen, S. Piana-Agostinetti, R.O. Dror, and D.E. Shaw, Science 334 , 517 (2011).
4(4) P.G. Bolhuis, D. Chandler, C. Dellago, P.L.. Geissler Annu. Rev. Phys. Chem. 53 , 291 (2002).
5(5) J.M. Bello-Rivas, and R. Elber J. Chem. Phys. 142 , 094102 (2015).
6(6) G.R. Bowman, V.J. Pande and F.Noé Advances in Experimental Medicine and Biology, vol. 797, Springer (2013).
7(7) A. Barducci, G. Bussi, and M. Parrinello Phys. Rev. Lett. 100 , 020603 (2008).
8(8) L. Maragliano, A. Fischer, E. Vanden Eijnden, G. Ciccotti, J. Chem. Phys. 125 , 024106 (2006).