Optimal reaction coordinates and kinetic rates from the projected   dynamics of transition paths

Line Mouaffac; Karen Palacio-Rodriguez; Fabio Pietrucci

arXiv:2302.12497·cond-mat.stat-mech·October 30, 2023

Optimal reaction coordinates and kinetic rates from the projected dynamics of transition paths

Line Mouaffac, Karen Palacio-Rodriguez, Fabio Pietrucci

PDF

Open Access

TL;DR

This paper presents an algorithm that automatically identifies optimal reaction coordinates and accurately predicts kinetic rates from limited transition path data, improving molecular simulation analysis.

Contribution

The authors introduce a Monte Carlo-based method that generates a sequence of reaction coordinates to find the optimal one minimizing the projected kinetic rate.

Findings

01

The method accurately approximates kinetic rates in a double-well system.

02

It successfully applied to complex atomistic systems involving carbon nanoparticles in water.

Abstract

Finding optimal reaction coordinates and predicting accurate kinetic rates for activated processes are two of the foremost challenges of molecular simulations. We introduce an algorithm that tackles the two problems at once: starting from a limited number of reactive molecular dynamics trajectories (transition paths), we automatically generate with a Monte Carlo approach a sequence of different reaction coordinates that progressively reduce the kinetic rate of their projected effective dynamics. Based on a variational principle, the minimal rate accurately approximates the exact one, and it corresponds to the optimal reaction coordinate. After benchmarking the method on an analytic double-well system, we apply it to complex atomistic systems: the interaction of carbon nanoparticles of different sizes in water.

Equations26

k_{A \to B} \leq \tilde{k}_{\tilde{A} \to \tilde{B}}

k_{A \to B} \leq \tilde{k}_{\tilde{A} \to \tilde{B}}

\overset{q}{˙} = - β D (q) \frac{\partial F ( q )}{\partial q} + \frac{\partial D ( q )}{\partial q} + 2 D (q) η (t) \leavevmode,

\overset{q}{˙} = - β D (q) \frac{\partial F ( q )}{\partial q} + \frac{\partial D ( q )}{\partial q} + 2 D (q) η (t) \leavevmode,

k^{- 1} = \int_{q_{0}}^{b} d x \frac{e ^{β F (x)}}{D ( x )} \int_{a}^{x} d y e^{- β F (y)}

k^{- 1} = \int_{q_{0}}^{b} d x \frac{e ^{β F (x)}}{D ( x )} \int_{a}^{x} d y e^{- β F (y)}

p (q^{'}, t + τ ∣ q, t) \approx \frac{1}{2 π μ} e^{- (q^{'} - q - ϕ)^{2} /2 μ}

p (q^{'}, t + τ ∣ q, t) \approx \frac{1}{2 π μ} e^{- (q^{'} - q - ϕ)^{2} /2 μ}

ϕ = a τ + \frac{1}{2} (a a^{'} + D a^{''}) τ^{2}, μ = 2 D τ + (a D^{'} + 2 a^{'} D + D D^{''}) τ^{2}

ϕ = a τ + \frac{1}{2} (a a^{'} + D a^{''}) τ^{2}, μ = 2 D τ + (a D^{'} + 2 a^{'} D + D D^{''}) τ^{2}

-\log\mathcal{L}(\theta)=\sum_{k=1}^{M-1}\Bigl{\{}\frac{1}{2}\log[2\pi\mu_{k}(\tau)]+\frac{[q_{k+1}-q_{k}-\phi_{k}(\tau)]^{2}}{2\mu_{k}(\tau)}\Bigr{\}}

-\log\mathcal{L}(\theta)=\sum_{k=1}^{M-1}\Bigl{\{}\frac{1}{2}\log[2\pi\mu_{k}(\tau)]+\frac{[q_{k+1}-q_{k}-\phi_{k}(\tau)]^{2}}{2\mu_{k}(\tau)}\Bigr{\}}

P=\min\left[1\,,\,\Big{(}\frac{k_{\text{old}}}{k_{\text{new}}}\Big{)}^{\alpha}\times\frac{\tau^{\text{noise}}_{\text{old}}}{\tau^{\text{noise}}_{\text{new}}}\right]

P=\min\left[1\,,\,\Big{(}\frac{k_{\text{old}}}{k_{\text{new}}}\Big{)}^{\alpha}\times\frac{\tau^{\text{noise}}_{\text{old}}}{\tau^{\text{noise}}_{\text{new}}}\right]

G_{k} = \frac{q _{k + 1} - q _{k} + [ β D ( q _{k} ) F ^{'} ( q _{k} ) - D ^{'} ( q _{k} )] τ}{2 D ( q _{k} ) τ}

G_{k} = \frac{q _{k + 1} - q _{k} + [ β D ( q _{k} ) F ^{'} ( q _{k} ) - D ^{'} ( q _{k} )] τ}{2 D ( q _{k} ) τ}

F (x, y) = - C [e^{- \frac{( x - x _{0} ) ^{2}}{2 σ _{x}^{2}}} e^{- \frac{( y - y _{0} ) ^{2}}{2 σ _{y}^{2}}} + e^{- \frac{( x - x _{1} ) ^{2}}{2 σ _{x}^{2}}} e^{- \frac{( y - y _{1} ) ^{2}}{2 σ _{y}^{2}}}]

F (x, y) = - C [e^{- \frac{( x - x _{0} ) ^{2}}{2 σ _{x}^{2}}} e^{- \frac{( y - y _{0} ) ^{2}}{2 σ _{y}^{2}}} + e^{- \frac{( x - x _{1} ) ^{2}}{2 σ _{x}^{2}}} e^{- \frac{( y - y _{1} ) ^{2}}{2 σ _{y}^{2}}}]

cc=\sum_{i\in S_{1}}\sum_{j\in S_{2}}C_{ij}\ ,\ \ \ C_{ij}=\frac{1-\Big{(}\frac{r_{ij}}{r_{0}}\Big{)}^{n}}{1-\Big{(}\frac{r_{ij}}{r_{0}}\Big{)}^{m}}

cc=\sum_{i\in S_{1}}\sum_{j\in S_{2}}C_{ij}\ ,\ \ \ C_{ij}=\frac{1-\Big{(}\frac{r_{ij}}{r_{0}}\Big{)}^{n}}{1-\Big{(}\frac{r_{ij}}{r_{0}}\Big{)}^{m}}

sc = - 2 π ρ k_{B} \int_{0}^{r_{max}} [g (r) ln g (r) - g (r) + 1] r^{2} d r \leavevmode,

sc = - 2 π ρ k_{B} \int_{0}^{r_{max}} [g (r) ln g (r) - g (r) + 1] r^{2} d r \leavevmode,

g (r) = \frac{1}{4 π N ρ r ^{2}} i \neq = j \sum \frac{1}{2 π σ ^{2}} e^{- (r - r_{ij})^{2} / (2 σ^{2})} \leavevmode,

g (r) = \frac{1}{4 π N ρ r ^{2}} i \neq = j \sum \frac{1}{2 π σ ^{2}} e^{- (r - r_{ij})^{2} / (2 σ^{2})} \leavevmode,

e^{- β F (q)} = \int d x \int d y e^{- β F (x, y)} δ [q - q (x, y)]

e^{- β F (q)} = \int d x \int d y e^{- β F (x, y)} δ [q - q (x, y)]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Advanced Chemical Physics Studies · Molecular Junctions and Nanostructures

Full text

Optimal reaction coordinates and kinetic rates from the projected dynamics of transition paths

Line Mouaffac

Sorbonne Université, Musée National d’Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Materiaux et de Cosmochimie, IMPMC, F-75005 Paris, France

Karen Palacio-Rodriguez

Sorbonne Université, Musée National d’Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Materiaux et de Cosmochimie, IMPMC, F-75005 Paris, France

Fabio [email protected]

Sorbonne Université, Musée National d’Histoire Naturelle, UMR CNRS 7590, Institut de Minéralogie, de Physique des Materiaux et de Cosmochimie, IMPMC, F-75005 Paris, France

Abstract

Finding optimal reaction coordinates and predicting accurate kinetic rates for activated processes are two of the foremost challenges of molecular simulations. We introduce an algorithm that tackles the two problems at once: starting from a limited number of reactive molecular dynamics trajectories (transition paths), we automatically generate with a Monte Carlo approach a sequence of different reaction coordinates that progressively reduce the kinetic rate of their projected effective dynamics. Based on a variational principle, the minimal rate accurately approximates the exact one, and it corresponds to the optimal reaction coordinate. After benchmarking the method on an analytic double-well system, we apply it to complex atomistic systems: the interaction of carbon nanoparticles of different sizes in water.

1 Introduction

Physico-chemical transformations such as phase transitions, chemical reactions, and biomolecular conformational changes are characterized by metastable states separated by free-energy barriers, so that transitions between states are rare events. Atomistic computer simulations of rare events – especially molecular dynamics (MD) – can play an important role, complementary to experiments, by predicting mechanisms, free-energy landscapes, and kinetic rates.

However, MD simulations face two prominent challenges. First, a gap – often of many orders of magnitude – between the short time scale of atomic motion (femtoseconds) and the long time scale of rare events hampers direct, brute-force simulations. Second, the high-dimensional nature of configuration space makes the analysis of the atomistic trajectories and the extraction of relevant information intrinsically difficult. [1, 2]

To overcome or at least alleviate these challenges and gain insights into the transformation processes, including quantitative thermodynamic and kinetic properties, it is necessary to reduce the dimensionality of the problem by projecting on an appropriate low-dimensional space. To this aim, collective variables (CVs) – generic functions of atomic coordinates able to track interesting structural changes – are often introduced, heuristically or by machine learning. [3]

Among all possible CVs describing a transition, the optimal CV, or “reaction coordinate" (RC), is widely considered the so-called committor function: in MD simulations, this function is defined as the probability $p_{B}(x)$ that a system will reach state $B$ before state $A$ starting from atomic positions $x$ in $3N$ -dimensional configuration space, with initial momenta randomly drawn from the equilibrium distribution [4, 5, 6, 7, 8, 9, 10].

The committor is considered optimal since i) it allows predicting the fate of atomic configurations towards reactants or products, and ii) it preserves the kinetics when employed to build a reduced model of the dynamics (vide infra). The concept dates back to the 1930’s, when Onsager studied the recombination of a pair of ions in the presence of a uniform electric field [11].

The committor can be used to design tests for the “quality" of a RC: for instance, a good RC $q$ is expected to display a distribution of values $p_{B}(x)$ sharply peaked at 0.5 when considering a set of different configurations $x$ corresponding to a same location $q(x)=q*$ in RC space, i.e., precisely identifying the transition state configurations. [4, 12, 13]

Some remarks are in order. Given a generic CV $q$ , the corresponding free energy landscape is always well-defined mathematically, via the equilibrium density $F(q)=-kT\log\rho_{\mathrm{e}q}(q)$ : if the CV is able to resolve (at least partially) reactants and products, such $F(q)$ will display a barrier. However, different CVs can contain different amounts of information about the transition mechanism. Moreover, the height of the barrier will depend on the particular choice of the CV [14, 15] (see Figure 1).

The barrier along the optimal RC is supposed to contain more useful information from the viewpoint of the calculation of the rate, keeping in mind that any invertible transformation of the optimal RC results again in an optimal RC, since such transformation cannot produce or destroy microscopic information: in the case of a non-linear transformation, the free-energy barrier and the diffusion coefficient are different for the two RCs [16, 14], the differences compensating each other in such a way that the projected dynamics keeps the same kinetic properties.

It is therefore important to distinguish the problem of converging the calculation of the free energy barrier along any given CV from a statistical viewpoint (a sampling problem) from the problem of identifying among all CVs the optimal RC and computing the corresponding barrier, for the purpose of estimating the rate.

Several methods, based on different theoretical principles, have been put forward to estimate optimal RCs from MD data. Given the customary identification of the optimal RC with the committor function, a possibility consists in optimizing a parametrization of the committor function based on $p_{B}(x)$ values estimated from transition path sampling (TPS), exploiting, e.g., the histogram test [13], maximum-likelihood approaches [17, 9], or artificial intelligence [18, 19].

We remark, however, that precisely estimating committor values very close to zero or to one by shooting requires an unfeasible large amount of trajectories: this hampers the accurate reconstruction of the optimal RC in presence of large free-energy barriers elsewhere than in the vicinity of the barrier top. This problem is addressed in the reweighted path ensemble approach [20].

Several approaches have been developed to generate flexible RC representations able to capture slow degrees of freedom connected to rare events [21, 22, 23, 24, 25, 26, 27]. Starting from the seminal work of Ma and Dinner [18], where a neural network was trained with committor data, recent years saw a flourishing of machine-learning approaches to RC optimization. [28, 29, 30, 31, 32]

In this work we address with a single methodology several prominent tasks of molecular simulations: identifying an optimal RC, estimating the free energy landscape and the diffusion along such RC, and estimating the rate of the process.

Results in the literature indicate that i) a model of the projected dynamics along the optimal RC gives more direct access to accurate kinetic rates [13, 8, 9], and ii) the accurate rate is the minimal one – attained for the optimal RC – with respect to the rates computed from the projected dynamics along each of all possible CVs [10]. The method we present exploits these principles: putative CVs are generated in a Monte Carlo scheme, and the kinetic rates of associated effective low-dimensional models are estimated and systematically minimized, leading to optimal RCs and accurate kinetic predictions. An important ingredient of our approach is the automatic construction of stochastic (Langevin) models of CV evolution: for this purpose, we adopt the maximum likelihood algorithm proposed in Ref. 33, based on short TPS-like MD trajectories. We remark that such task is non-trivial from a numerical viewpoint [34], and alternative approaches exist [35, 36, 37, 38].

2 Theoretical methods

2.1 Variational principle: the optimal coordinate provides the minimal rate

We start by considering ergodic diffusive dynamics (as described by overdamped Langevin equations) in a high-dimensional space, for a system with two metastable states $A$ and $B$ . The transition rate from state $A$ to $B$ is denoted by $k_{A\rightarrow B}$ . Next, we consider the effective dynamics resulting from the projection of the full high-dimensional dynamics onto a low-dimensional manifold of collective variables (CVs): such effective dynamics is an approximation of the original one, formulated again in terms of diffusive equations [39, 10]. the reaction rate of the effective dynamics between states $\tilde{A}$ and $\tilde{B}$ , defined in the low-dimensional space, is $\tilde{k}_{\tilde{A}\rightarrow\tilde{B}}$ . Zhang, et al.[10] proved that the transition rate of the full dynamics is always smaller or equal to the one computed using effective dynamics. In other words, the optimal reaction coordinate yields a minimal rate,

[TABLE]

The optimal projection, preserving the value of the rate constant $k_{A\rightarrow B}=\tilde{k}_{\tilde{A}\rightarrow\tilde{B}}$ , is achieved when the CV corresponds to the committor function or any invertible function of the latter.

As a complementary result, Ref. [40] formally proved that it is possible to obtain accurate rates from an overdamped Langevin model when projecting high-dimensional underdamped Langevin dynamics on a single CV, the committor, irrespective of the existence of timescale separation and metastability in the system. Considering that typical MD simulations, where Hamilton’s equations are coupled with a thermostat, are akin to high-dimensional underdamped Langevin equations, the previous result should hold also in conventional MD.

Our approach consists in optimizing a RC defined as a function of several trial CVs by minimizing the kinetic rate of the projected dynamics with respect to all possible definitions of the RC. Here we consider the projection on a one-dimensional CV, assuming that the variational principle connecting the optimal CV to the minimal rate holds also when the input data is represented by MD simulations with a thermostat. The projected dynamics is approximated by an overdamped Langevin equation, consistently with Refs. [10, 40]:

[TABLE]

with $D(q)$ the diffusion coefficient, $F(q)$ the free energy profile and $\eta(t)$ a Gaussian white noise of zero mean and unit variance. The use of this stochastic model has several appealing features: the only parameters of the equation consist in the functions $F(q)$ and $D(q)$ , the mass and velocities are not explicitly needed, the Markovian character allows for simple likelihood expressions, and the mean first passage time (MFPT) of the $A\rightarrow B$ transition, inverse of the kinetic rate $k$ , can be directly obtained by numerical integration [41, 14]:

[TABLE]

where $q_{0}$ is the starting point in $A$ for the MFPT computation, while $a$ and $b$ are respectively the reflecting and absorbing boundaries.

2.2 Algorithm for reaction coordinate optimization

The phase-space information used for RC optimization is represented by MD trajectories spanning transitions between the reactants and products regions of configuration space. We adopt TPS as a source of input data for several reasons: [12, 13, 9] i) ergodic trajectories (spontaneously spanning the transitions) are unfeasible in presence of free-energy barriers $\gg k_{B}T$ , while TPS has an affordable cost for many systems; ii) TPS trajectories are free from biasing forces and, at statistical convergence, faithfully reproduce ergodic trajectories; iii) recent numerical evidence indicate that $\sim 100$ TPS-like MD trajectories projected on a suitable CV are sufficient to reconstruct accurate free-energies and rates by means of overdamped Langevin models [33].

The proposed RC optimization algorithm starts from the projection of TPS trajectories on a basis set of $N$ potentially-relevant CVs ${\bf Q}(t)=(Q_{1}(t),Q_{2}(t)...Q_{N}(t))$ . All these coordinates are put on the same footing by shifting and scaling so that for each of them the range of variation on the actual MD trajectories is between 0 and 1 (oriented so that $A\rightarrow B$ for growing values). Note that such a linear transformation of the coordinate does not deform the corresponding free energy landscape besides an irrelevant additive constant: passing from $x$ to $y=ax+b$ the probability densities transform according to $e^{-\beta[F(x)-\tilde{F}(y)]}=\rho(x)/\tilde{\rho}(y)=dy/dx=a$ . Moreover, the new diffusion coefficient is scaled, $\tilde{D}_{y}/D_{x}=(dy/dx)^{2}=a^{2}$ [42], as can be easily seen in the simple case of a driftless constant- $D$ diffusion process: $\langle[x(t)-x(0)]^{2}\rangle=2D_{x}t$ implies that $\langle[y(t)-y(0)]^{2}\rangle=2(a^{2}D_{x})t=2\tilde{D}_{y}t$ .

A trial RC $q=\sum_{1}^{N}w_{i}\,Q_{i}\equiv{\bf w}\cdot{\bf Q}$ as a linear combination of the basis-set CVs is generated from random weights normalized as ${\bf w}^{2}=1$ . A Monte Carlo loop is then started, at each step proposing a new RC $q_{\text{new}}$ obtained by modifying the weights of the last accepted step through small random variations ${\bf w}\rightarrow{\bf w}+\delta{\bf w}$ (each $\delta w_{i}$ being drawn from a uniform distribution between $[-0.05,0.05]$ ).

A newly proposed RC is accepted or rejected based on a Metropolis criterion aimed at minimizing the kinetic rate estimated from a maximum-likelihood Langevin model. The latter is optimized following the method in Ref. [33]: for a sufficiently small time interval $\tau$ , the transition probability density $p$ (propagator) between points $q$ and $q^{\prime}$ in CV-space can be approximated as [43]

[TABLE]

where the prime indicates $d/dq$ , $a=-\beta DF^{\prime}+D^{\prime}$ is the drift in Eq. 2, and the approximation includes terms up to order $\tau^{2}$ . We verified that resorting to the less accurate first-order propagator $\phi=a\tau=(-\beta DF^{\prime}+D^{\prime})\tau$ , $\mu=2D\tau$ gives unsatisfactory results for the systems considered in this work.

Based on Eq. 5, the likelihood of the MD trajectory, sampled with a time resolution $\tau$ and projected on a CV $q$ is given by

[TABLE]

where $\theta$ represents the set of all parameters of the Langevin equation (i.e., the profiles $F(q)$ and $D(q)$ ), $M$ is the number of trajectory configurations, and $\phi_{k}(\tau)\equiv\phi(q_{k},\tau)$ , $\mu_{k}(\tau)\equiv\mu(q_{k},\tau)$ . The optimal Langevin model for the given $q$ and $\tau$ , obtained by minimizing $-\log\mathcal{L}(\theta)$ as a function of the parameters with an iterative stochastic algorithm (see Ref. [33] for details), directly yields the free energy and diffusion profiles.

The kinetic rate is computed from the Langevin model using Eq. 3, with integral boundaries defined in the following way: $a$ is the smallest observed value of $q$ (for each CV the transition $A\rightarrow B$ is in the direction of increasing $q$ values), while $q_{0}$ and $b$ are the average final values of $q$ for shooting trajectories ending in $A$ and $B$ , respectively, i.e., the bottom of the corresponding free-energy minima.

Having obtained the rate corresponding to a given proposed CV, a Metropolis-like test is employed to accept or reject the CV based on the following expression for the probability:

[TABLE]

this expression tends to minimize the rate (with $\alpha$ playing the role of an adjustable inverse Monte Carlo temperature, as discussed in the Results) and simultaneously – coherently with the hypothesis of a Markovian Langevin-like behavior – the memory of the stochastic process. The latter is quantified by $\tau^{\text{noise}}$ , the auto-correlation time of the “observed" noise trajectory $G_{k}$ estimated from the $q_{k}$ trajectory via the optimal Langevin model:

[TABLE]

(see Ref. [33] for details).

To summarize, the algorithm performs the following steps:

Project TPS trajectories relaxing from a barrier top onto a basis set of CVs $\{Q_{i}\}_{i=1,...,N}$ 2. 2.

Construct a first trial RC $q$ as a sum of the basis CVs with random weights 3. 3.

Generate a new CV $q_{\text{new}}$ by randomly modifying the weights 4. 4.

Estimate the free energy and diffusion profiles as well as the $A\rightarrow B$ rate from an optimal Langevin model of the observed trajectory $q_{\text{new}}(t)$ using likelihood maximization 5. 5.

Accept or reject the new CV with the probability in Eq. 7, aimed at minimizing the rate and enforcing Markovianity 6. 6.

Go to step 3 (iterate until convergence of the rate)

At convergence, the algorithm provides an optimal RC, as well as the corresponding free-energy and diffusion profiles and kinetic rate, all starting solely from a set of $\leavevmode\nobreak\ 100$ TPS-like short MD trajectories. If the basis set contains all the relevant degrees of freedom for the transition process, the optimal rate should approach the exact rate.

2.3 Analytic double-well potential

A two-dimensional double-well free-energy surface is defined, for benchmark purposes, as the sum of two Gaussian-shaped wells:

[TABLE]

with $C=20\,k_{B}T$ , Gaussian centers $(x_{0},y_{0})$ = $(0.1723,0.5058)$ , $(x_{1},y_{1})$ = $(0.8060,0.5058)$ , and widths $\sigma_{x}^{2}=0.03921$ , $\sigma_{y}^{2}=0.2519$ .

We consider the region $x\in[0,1]$ and $y\in[0,1]$ , see Fig. 1a. The diffusion matrix was set to $\mathbf{D}=0.015\cdot\big{(}\begin{smallmatrix}1&0\\ 0&1\end{smallmatrix}\big{)}$ ps*-1*.

The reference MFPT was estimated directly from brute-force trajectories. For this purpose, 10 long overdamped Langevin simulations with a time step of 0.1 fs were performed. For the aggregate simulation time (10 $\mathrm{\mu}$ s), 16 039 jumps from state $A$ to $B$ were observed, yielding a MFPT of $5600\pm 40$ ps and a transition rate $k_{A\rightarrow B}=1.79\pm 0.01\cdot 10^{-4}$ ps*-1*.

Input trajectories for RC optimization were obtained by shooting 100 overdamped Langevin trajectories from the barrier top. From these trajectories, 51 relaxed to state $A$ (left well) and 49 relaxed to state $B$ (right well). The basis set CVs, in this case, are simply $x$ and $y$ , yielding two one-dimensional projections $x(t)$ and $y(t)$ of the 100 relaxing trajectories.

2.4 Fullerene dimers in water

The association and dissociation of C60 and C240 fullerene dimers (OPLS-AA force-field [44]) in water solution (SPC force-field [45]) have been simulated by MD using GROMACS v2019.4 [46, 47] patched with PLUMED 2.5.3 [48]. The simulations are similar to those in Ref. 33: for more computational details than what is summarized here, we refer to the latter article.

In the first system, two C60 molecules were solvated by 2398 water molecules in a simulation box of $3.607^{3}$ nm3 with periodic boundary conditions. In the second system, two C240 molecules were solvated by 5375 water molecules in a simulation box of $5.22^{3}$ nm3. MD simulations were performed with a time step of 1 fs in the $NPT$ ensemble at 298 K [49] (thermostat time constant = 1 ps) and 1 atm [50] (barostat time constant = 4 ps).

The reference free-energy profiles of the association/dissociation of the C60 fullerenes in water as a function of the 8 basis CVs were computed from unbiased simulations of 1 $\mathrm{\mu}$ s total aggregated time, from the population histogram: $F(Q_{i})=-k_{B}T\log\rho_{\mathrm{eq}}(Q_{i})$ . Error bars were estimated as the standard deviation of the mean over 5 independent replicas.

We used as input data for RC optimization a set of 100 aimless shooting [51, 17] trajectories of 20 ps, generated using the script publicly available at https://github.com/physix-repo/aimless-shooting, employing a separation of 0.1 ps between successive shooting points.

For the C60 dimer we define the dissociated state based on the distance between centers of mass as $d\geq 1.34$ nm and the associated state as $d\leq 1.17$ nm; for the C240 dimer we define the dissociated and associated states as $d\geq 2.01$ nm and $d\leq 1.9$ nm, respectively. The reference rate constants for the dissociation of the C60 and C240 dimers from the associated state were estimated using the reactive flux formalism over 1000 aimless shooting trajectories (see Ref. 33 for details), obtaining for C60 a MFPT of $6.1\pm 1.2$ ns and for C240 a MFPT of $9.4\pm 1.1$ $\mathrm{\mu}$ s.

The basis set CVs employed for projecting the fullerene dimers trajectories are the following:

$d$ : the distance between the centers of mass of each fullerene molecule. 2. 2.

$cc$ : the number of carbon-carbon contacts, estimated summing continuous coordination functions between any atom of the first fullerene (set $S_{1}$ ) and any atom of the second fullerene (set $S_{2}$ )

[TABLE]

where $r_{ij}$ is the distance between atoms $i$ and $j$ , with parameters $r_{0}=0.35$ nm, $n=6$ and $m=10$ . 3. 3.

$c2w$ : the number of carbon-water contacts, defined similarly to $cc$ in Eq. 10, in this case with $r_{0}=0.6$ nm, set $S_{1}$ including all carbon atoms of the two fullerene molecules, and set $S_{2}$ including all oxygen atoms of the water molecules. 4. 4.

$c1w$ : the water-carbon contacts for a single fullerene molecule, defined as $c2w$ except for the inclusion of a single fullerene in set $S_{1}$ . 5. 5.

$sc$ : the approximate carbon pair entropy, estimated using the expression [52]

[TABLE]

where $\rho$ is the density, $r_{\mathrm{max}}$ is a cutoff set to 0.65 nm, and $g(r)$ is the pair distribution function of carbon atoms, estimated via Gaussian kernels as

[TABLE]

where $N$ is the number of carbon atoms and $\sigma=0.01$ nm. 6. 6.

$sw$ : the approximate water pair entropy, estimated with the same equations and parameters as $sc$ applied to oxygen atoms. 7. 7.

$ucc$ : the Van der Waals carbon-carbon potential energy, computed over all carbon pairs. 8. 8.

$ucw$ : the Van der Waals carbon-water potential energy, computed between all the carbon atoms and all the solvent atoms.

In the Results section, for each putative RC in the optimization algorithm, the likelihood of a Langevin model was maximized employing $3\cdot 10^{6}$ and $2\cdot 10^{6}$ iteration steps for the $C_{60}$ and $C_{240}$ fullerenes, respectively, adopting a time resolution $\tau=1$ ps for the projected MD trajectory. The latter was tested as sufficient to yield time-decorrelated noise when using the dimer center-of-mass distance $d$ as CV.

3 Results and discussion

3.1 Two-dimensional double well

To test the validity of the algorithm developed, we first benchmark it on an analytic model: the double-well system detailed in the Methods section. By construction the dynamics is of overdamped-Langevin nature, the trajectories being obtained by integration of such stochastic differential equation over the two-dimensional $F(x,y)$ landscape in Figure 1 with a constant isotropic diffusion matrix.

After projecting the trajectories on a one-dimensional CV, several non-trivial effects can be anticipated from theory. Figure 1 illustrates for instance three simple choices of the CV, corresponding to the $x$ axis, the $y$ axis, or the 45∘ diagonal.

Langevin trajectories shown in Figure 1(b) clearly resolve two distinct end-states when adopting $q\equiv x$ , while the separation becomes less clear upon deviation of $q$ from the $x$ axis until vanishing for $q\equiv y$ .

The exact one-dimensional $F(q)$ profile obtained by projecting on a generic CV $q(x,y)$ can be computed from probability marginalization as

[TABLE]

Figure 1(c) shows that CV changes result in significant differences in the one-dimensional free-energy profiles. In particular the barrier is maximal for $q\equiv x$ , it is strongly reduced for the diagonal, and it vanishes for $q\equiv y$ ; analogously, the distance between the two minima is largest for $q\equiv x$ and it vanishes for $q\equiv y$ . These observations support the intuitive idea that $x$ is the optimal RC for this problem, fully capturing the progress of the transition mechanism. Moreover, the reduced barriers for alternative CVs – simply the effect of overlapping contributions of the two wells – suggest faster rates for the corresponding effective dynamics. The effective one-dimensional diffusion coefficient $D(q)$ (Figure 1(c)) is also affected by projection, increasing when $q$ deviates from $x$ , however its effect on the rate is linear, less important than barrier variations, which have an exponential effect.

Starting from $x$ and $y$ as basis CVs, 10 independent runs of the RC optimization algorithm are performed, aimed at minimizing the rate of the effective projected model (at time resolution $\tau=0.01$ ps), each starting with a different random linear combination as initial CV (see Figure 2).

The parameter $\alpha$ (effective Monte Carlo inverse temperature in Eq. 7) is varied from 2 to 4: in all tested cases the initial rate decreases almost monotonously, often by several orders of magnitude, until reaching an equilibrium behavior with small fluctuations in a small interval. Low $\alpha$ values increase the acceptance of putative CVs in presence of an increase of the rate, whereas high $\alpha$ values decrease such acceptance causing the stochastic optimization to converge more rapidly towards a minimal rate, with smaller equilibrium fluctuations. $\alpha$ plays therefore the role of an adjustable parameter providing some control over the convergence of the rate minimization algorithm.

Irrespective of the initial putative CV the algorithm successfully and consistently retrieves an optimal combination including at least $95\%$ of $x$ . Given that after the initial relaxation a stationary distribution of equilibrated $k$ values is reached, with a finite width that depends on the parameter $\alpha$ , introducing a degree of fuzziness, different criteria could be envisaged to identify the optimal rate values and the corresponding optimal RCs. In the case of the double well system, identifying the predicted rate with the average value of the stationary distribution and its uncertainty with the standard deviation yields $2.6\pm 1.0\cdot 10^{-4}$ ps*-1* for $\alpha=3$ , in good agreement with the reference brute-force rate of $1.8\cdot 10^{-4}$ ps*-1*.

However, the results for the solvated fullerene dimers (see next section), a realistic complex system, indicate the possibility of a few spurious outliers at the lowest $k$ values, resulting from imperfect optimization of the Langevin model for a few putative CVs. This suggests an alternative criterion to identify the optimal rate predicted by the algorithm, as the 5th percentile in the stationary distribution, i.e., the rate below which 5% of the rates are found in the distribution. This criterion predicts (for $\alpha=3$ ) a 5th percentile value of $1.6\cdot 10^{-4}$ ps*-1*: taking as optimal prediction the average over the 5 rates closest to such value for each of the 10 independent optimization runs, we get a predicted rate of $1.60\pm 0.04\cdot 10^{-4}$ ps*-1* again in good agreement with the reference.

We remark that, even though the original dynamics in $(x,y)$ -space is Markovian, non-Markovian effects could appear on the effective dynamics when projecting on CVs different from the committor, requiring in principle an underdamped (or generalized) Langevin description [40]. Therefore, the term penalizing deviations from Markovianity in Eq. 7 plays, in general, an important role, helping to reject proposed CVs that would introduce significant memory effects rendering inappropriate the overdamped model. Moreover, as shown in Ref. 33, from the viewpoint of the inference of Langevin models non-Markovian behavior appears to be correlated with barrier overestimation and thus rate underestimation, a spurious effect that must be avoided to exploit correctly the variational principle connecting the minimal rate to the optimal RC.

Our simulations show that for all values of $\alpha$ the vast majority of the CVs accepted during the optimization process have an effective noise free from time-correlation (Figure 2), with occasional exceptions, confirming the ability of the algorithm to enforce Markovianity.

3.2 Interaction between fullerene nanoparticles in water

We applied the RC optimization algorithm to the case of the dissociation process of fullerene dimers in water solution, respectively C60 and C240. These systems are rather complex, including thousands of atoms, and feature an associated state, corresponding to a free-energy minimum, characterized by the carbon nanoparticles in close contact without the mediation of water molecules, and a dissociated state where each fullerene is fully-solvated. The transition state region features a water molecule bridging between the fullerenes (Figure 3), corresponding to the top of a sizable free-energy barrier when a CV able to resolve the metastable states, like the distance $d$ between centers of mass, is employed [33].

We start by proposing a pool of eight basis CVs to be used as building blocks for RC optimization: the set ranges from geometry-inspired CVs like the distance $d$ and the total number of carbon-carbon or carbon-water contacts ( $cc$ , $c2w$ , $c1w$ ) to physics-inspired CVs like the approximate two-body carbon or water entropy ( $sc$ , $sw$ ) and the carbon-carbon or carbon-water interaction energy ( $ucc$ , $ucw$ ). It is not obvious a priori which CVs or combination thereof could be the best approximations of an optimal RC.

The TPS trajectories relaxing from the barrier top, together with the free-energy and diffusion profiles inferred from likelihood maximization of Langevin models for each of these CVs are reported for the C60 dimer in Figure 4, and display, as expected, strong variations depending on the CV choice (see the Supporting Information for C240). Clearly, CVs able to resolve well the associated and dissociated states feature a barrier, contrary to CVs lacking such resolving power. Overall, the $F(q)$ landscapes compare well with the reference brute-force ones. A time resolution $\tau=1$ ps is adopted for Langevin modelling, as it yielded Markovian behavior in combination with the CV $d$ for both fullerene sizes in Ref. 33.

The basis CVs that results in the lowest kinetic rates, of the order of $10^{-3}$ ps*-1*, are $d$ , $cc$ and $ucc$ ; the CVs providing the highest (hence, less accurate) rates are instead $sw$ and $c1w$ , of the order of $10^{-1}$ ps*-1*. This allows a first ranking of the quality of the basis CVs. The position-dependent diffusion coefficient $D(q)$ is roughly between $0.01$ and $0.03$ ps*-1* for all CVs except $sw$ , that reaches almost $0.08$ ps*-1* (Figure 4).

We applied the RC optimization algorithm (500 accepted steps, with $10^{6}$ Langevin optimization iterations for each putative CV) starting from different initial random combinations of the basis CVs and testing the parameter $\alpha=2,3,4$ . As shown in Figure 5, for both fullerene sizes the rate relaxes to a stationary distribution after about 200 steps: rate fluctuations are wider in these systems than for the double well, however the lower part of the distribution has the same order of magnitude of the reference rates.

Employing the 5th percentile of the stationary $k$ distribution as criterion (see previous section) at $\alpha=3$ , we obtain as optimal rates the values $1.26\pm\leavevmode\nobreak\ 0.15\cdot 10^{-4}$ ps*-1* and $1.8\pm\leavevmode\nobreak\ 1.1\cdot 10^{-8}$ ps*-1* for the dissociation C60 and C240 dimers, respectively, that compare well with the references reactive-flux rates $1.6\pm\leavevmode\nobreak\ 0.3\cdot 10^{-4}$ ps*-1* and $1.1\pm\leavevmode\nobreak\ 0.1\cdot 10^{-7}$ ps*-1*, respectively (see Methods section).

The Markovianity appears enforced in an effective way, with the vast majority of the accepted CVs having uncorrelated effective noise and a few ones displaying $<10$ correlated steps (see Figure 5(c, g)).

Finally, the composition of the optimal RCs for the solvated C60 and C240 dimers is reported in Figure 5(d, h): since the basis CVs are not uncorrelated between them, the optimal RC is not uniquely defined and different weights in the combination $q=\sum_{1}^{N}w_{i}\,Q_{i}$ can lead to RCs with similar quality and similar rates.

In fact, performing a principal component analysis on the members of the set of optimal RCs (i.e., the 50 RCs with rate closest to the 5th percentile in Figure 5(b,f)) provides explained variance ratios for the first four principal components $[0.40,0.24,0.14,0.11]$ for C60 and $[0.56,0.17,0.16,0.04]$ for C240 (see Supporting Information). This shows that it is not possible to easily reduce the dimensionality in the optimal space of RCs by means of one or two dominating components.

Nevertheless, some basis CVs contribute more than others, on average, to the optimal RCs: in particular, Figure 5(d,h) suggest that $d$ , $cc$ , $sc$ , and $ucc$ play a prominent role for both fullerene sizes. The latter CVs do not contain water degrees of freedom, pointing, somehow counter-intuitively, to a limited role of the solvent in the optimal RC for these systems.

4 Concluding remarks

We presented an algorithm for the automatic optimization of the RC for activated processes starting from $\approx 100$ short TPS MD trajectories and a basis set of putative CVs. The approach is based on the variational principle in Ref. 10: the rate predicted by a Langevin model of the effective projected dynamics is minimized by the optimal RC, coinciding in this limit with the true MD rate.

As an essential ingredient, we adopt the method in Ref. 33 to infer an optimal overdamped Langevin model for each putative RC, thus reliably recovering the corresponding free-energy and diffusion profiles as well as the kinetic rate.

Since we make the assumption that an overdamped stochastic equation be able to faithfully model the projected dynamics, we explicitly disfavoured the appearance of memory effects in the acceptance probability expression applied to newly proposed CVs, thus keeping the Markovianity under control.

Numerical tests on an analytic double-well system as well as on MD simulations of carbon nanoparticles in explicit water solution indicate that optimal RCs are systematically identified in a robust and efficient way: this allowed us to predict kinetic rates in the microsecond time scale from $\sim 10$ ps-long trajectories.

A foremost advantage of the present algorithm is that RC optimization is done as post-processing of a MD data set: a large number of putative CVs can be screened with limited computer resources without the need to re-run expensive MD calculations. Moreover, compared to alternative approaches based on machine learning the committor, there is no need to estimate committor values over a large sample of atomic configurations, a computationally-expensive endeavour that, by construction, is limited to the region close to the barrier top. On the contrary, the proposed approach only requires a small number of reactive trajectories and, by construction, it learns the optimal RC across the full span of the transition paths joining metastable states, even where the committor assume values very close to zero or to one.

Here the basis set of CVs is built from heuristic considerations, while in future works it could be advantageous to devise an unsupervised building procedure. Likewise, from the simplistic formulation of optimal RCs as linear combinations of basis CVs, it would be beneficial for complex processes to adopt more flexible, nonlinear formulations, possibly including neural networks. As a final word of caution, the effective description adopted here in terms of overdamped Langevin equations is appropriate for a subset of all possible rare-events systems and processes: underdamped or generalized (non-Markovian) stochastic models can be necessary in other cases.

We expect that the approach introduced here could help discovering optimal RCs and predicting accurate rates for challenging open problems like crystal nucleation and protein-protein or protein-ligand dissociation, whose complexity, so far, eluded systematic and predictive computational studies.

5 Acknowledgement

We gratefully acknowledge very insightful discussions with Christoph Dellago, Gerhard Hummer, Hadrien Vroylandt, Arthur France-Lanord, Alessandro Barducci, Bettina Keller, Tony Lelièvre and Edina Rosta. This work was performed with the support of the Institut des Sciences du Calcul et des Données (ISCD) of Sorbonne University (IDEX SUPER 11-IDEX-0004). Calculations were performed on the GENCI-IDRIS French national supercomputing facility, under grant numbers A0110811069, A0120901387, A0130811069.

6 Supporting information

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Pietrucci [2017] Fabio Pietrucci. Strategies for the exploration of free energy landscapes: Unity in diversity and challenges ahead. Rev. Phys. , 2:32–45, 2017.
2Glielmo et al. [2021] Aldo Glielmo, Brooke E Husic, Alex Rodriguez, Cecilia Clementi, Frank Noé, and Alessandro Laio. Unsupervised learning methods for molecular simulation data. Chem. Rev. , 121(16):9722–9758, 2021.
3Gkeka et al. [2020] Paraskevi Gkeka, Gabriel Stoltz, Amir Barati Farimani, Zineb Belkacemi, Michele Ceriotti, John D Chodera, Aaron R Dinner, Andrew L Ferguson, Jean-Bernard Maillet, Hervé Minoux, et al. Machine learning force fields and coarse-grained variables in molecular dynamics: application to materials and biological systems. J. Chem. Theory Comput. , 16(8):4757–4775, 2020.
4Bolhuis et al. [2002] Peter G Bolhuis, David Chandler, Christoph Dellago, and Phillip L Geissler. Transition path sampling. Annu. Rev. Phys. Chem , 53:291–318, 2002.
5Rhee and Pande [2005] Young Min Rhee and Vijay S Pande. One-dimensional reaction coordinate and the corresponding potential of mean force from commitment probability distribution. J. Phys. Chem. B , 109(14):6780–6786, 2005.
6Metzner et al. [2006] Philipp Metzner, Christof Schütte, and Eric Vanden-Eijnden. Illustration of transition path theory on a collection of simple examples. J. Chem. Phys , 125(8):084110, 2006. doi: 10.1063/1.2335447 .
7E and Vanden-Eijnden [2010] Weinan E and Eric Vanden-Eijnden. Transition-path theory and path-finding algorithms for the study of rare events. Annu. Rev. Phys. Chem. , 61:391–420, 2010.
8Banushkina and Krivov [2016] Polina V Banushkina and Sergei V Krivov. Optimal reaction coordinates. Wiley Interdiscip. Rev. Comput. Mol. Sci. , 6(6):748–763, 2016.