Committors without Descriptors

Peilin Kang; Jintu Zhang; Enrico Trizio; TingJun Hou; Michele Parrinello

PMC · DOI:10.1021/acs.jctc.5c01848·February 11, 2026

Committors without Descriptors

Peilin Kang, Jintu Zhang, Enrico Trizio, TingJun Hou, Michele Parrinello

PDF

Open Access

TL;DR

This paper introduces a method using graph neural networks to study rare events in molecular simulations without relying on predefined descriptors.

Contribution

The novel approach uses graph neural networks to automate and improve the modeling of committors in rare event simulations.

Findings

01

Graph neural networks can directly process atomic coordinates, improving the modeling of rare events.

02

The method enables efficient sampling of transition states in systems like ion pair dissociation and ligand binding.

03

The approach is self-consistent and semiautomatic, using a variational criterion to optimize the committor.

Abstract

The study of rare events is one of the major challenges in atomistic simulations, and several enhanced sampling methods toward its solution have been proposed. Recently, it has been suggested that the use of the committor, which provides a precise formal description of rare events, could be of use in this context. We have recently followed up on this suggestion and proposed a committor-based method that promotes frequent transitions between the metastable states of the system and allows extensive sampling of the process transition state ensemble. One of the strengths of our approach is being self-consistent and semiautomatic, exploiting a variational criterion to iteratively optimize a neural-network-based parametrization of the committor, which uses a set of physical descriptors as input. Here, we further automate this procedure by combining our previous method with the expressive…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes2

l(2)34Fc anon-A1

Proteins2

Chemicals10

carbonate NaCl CaCO3 Alanine dipeptide O H2O Calixarene 4-cyanobenzoic acid dipeptide Ca

Diseases1

TS

Figures5

Click any figure to enlarge with its caption.

Truncated graph. Schematic representation, using the dissociation of NaCl as an example, of the truncated graph construction to reduce the computational cost of using GNN-based methods when a few reacting atoms (NaCl) interact with a large number of environment atoms (H2O). Only the atoms belonging to the neighborhood defined by the cutoff radius R t from reacting atoms are associated with graph nodes, whereas the other are neglected (A and C). Such nodes are then connected with edges based on the cutoff R c < R t (B and C). To avoid having reacting atoms in disconnected graphs, edges between reacting atoms can be enforced regardless of the distance (D). The whole truncated graph is processed through the GNN model, but only the reacting atoms are considered in the readout function to obtain the final CV output.

Alanine dipeptide. (A) Relevant torsional angles of alanine dipeptide. (B) Relative relevance, given by the color scale, to the GNN-based committor model of the graph nodes associated with the heavy atoms. (C) Free energy surface (FES) projected along the ϕ torsional angle obtained with the GNN-based setup proposed in this work (blue solid curve) and a reference OPES simulation using ϕ and ψ as CVs (red dashed curve).

Calixarene. Snapshots of representative configurations along the dominant reaction pathway of B w to B d in the binding process of the G2 ligand (orange) to the OAMe host molecule (gray) in water, grouped according to the z value. The water molecules are represented as blue sticks, depicted in transparency when outside the binding cavity and in solid color when inside.

NaCl dissociation. (A) Computed free energy surface (FES), indicated by the color map, in the space defined by the number of water molecules bridging the ions n b and by the interionic distance d NaCl. The two reactive pathways from the associated states are indicated by dashed lines. (B) Distribution of the Kolmogorov probability pK , indicated by the color map, in the n b and d NaCl plane. The isolines of the free energy of A are superimposed in white as a reference. The insets show representative configurations of the transition states that characterize the two reaction pathways.

CaCO3 dissociation. (A) Free energy surface (FES), indicated by the color map, in the plane defined by the interionic distance d Ca–C and the number of solvating water molecules around the Ca2+ ion n w. The metastable states are indicated on the FES, and representative snapshots are provided in the insets. The position of the transition states is indicated by a star. (B) Medoid configurations for the two transition state ensemble clusters TS1 and TS2. (C) Average attention scores Att(r) of messages from water oxygen nodes to the Ca2+ node as a function of the interatomic distance d Ca–Ow (red line). The radial distribution function of water oxygens with respect to Ca2+ is reported as a reference (blue line). The two curves are both normalized to be in the range (0,1).

Funding1

—National Natural Science Foundation of China10.13039/501100001809

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Crystallography and molecular interactions

Full text

Introduction

I

Atomistic simulations are an indispensable tool in the study of complex physicochemical processes. However, such simulations find one of their limits in the gap between the affordable simulation time and the typically much longer time scale over which many important phenomena like chemical reactions, protein folding, and crystallization take place. Such processes are indeed characterized by rare transitions between metastable states, which are separated by large free energy barriers that act as kinetic bottlenecks hindering sampling. This has been called the rare event problem, and, since the introduction of umbrella sampling some 50 years ago, a multitude of enhanced sampling approaches have been suggested to solve it.?

Recently, we have introduced a new enhanced sampling method based on the committor function that greatly alleviates the rare event problem? by promoting extensive sampling of both transition and metastable states, and have started using this approach to solve a number of real-life problems. ?−? ? ? We recall here that the committor q(** x **) is a function of the atomic coordinates ** x ** which, given two metastable states A and B, gives the probability that a trajectory started in ** x ** reaches B without having first passed by A.? The committor is arguably the most precise way of describing rare events, since it is a quantity that remains well-defined even if the transition from A to B follows different competing pathways or passes through an intermediate metastable state. The committor is also believed to be the optimal one-dimensional reaction coordinate.?

Unfortunately, if one follows the committor formal definition, a rather expensive trial-and-error strategy is needed for its determination.? Recently, alternative ways have been proposed to learn the committor function from simulation data using machine learning tools in combination with different learning criteria and enhanced sampling schemes. ?,?,?−? ? ? ? ? ? ? ? Such methods have been recently reviewed in ref ?. Our contribution to this area ?,? is based on the Kolmogorov variational principle, which is obeyed by the committor? which has been solved in a self-consistent procedure that eventually leads not only to the calculation of q(** x **) but to an extensive and balanced evaluation of the free energy equation.

However, like in all variational calculations, the quality of the results depends on the expressivity of the trial function used.? In our initial approach, we have represented q(** x ) as a feed-forward neural network q( x ) = q θ( d ( x **)) whose input is a set of descriptors ** d ( x **) chosen to be invariant with respect to the symmetries of the system and whose weights θ are optimized to minimize the functional associated with Kolmogorov variational problem. Despite having proven to be effective in several challenging systems, ?−? ? ? this approach still relies, at least partially, on the user’s insight for the choice of appropriate descriptors, which, in complex cases, might not be easy to select.

The purpose of this paper is to remove as much as possible this potential obstacle and make the calculation of q(** x **) as automatic as possible. To achieve this goal, we parametrize the committor with a geometric Graph Neural Network (GNN),? which can directly use as input the atomic Cartesian coordinates ** x ** while respecting the invariance laws of the system. This architecture has already proven to be useful in building machine-learning potentials ?−? ? and in designing collective variables (CVs). ?−? ? ? ? When using a GNN, an atomic system is naturally represented by a graph whose nodes are the atoms and whose connecting edges describe their relationship. Optionally, nodes and/or edges can be assigned attributes that encode information on the system, greatly facilitating the analysis of the results,? especially if an attention mechanism is added to the GNN structure.? Furthermore, GNN architectures can be made invariant or equivariant with respect to the symmetry operations of the system. However, due to their more complex structure, GNN models are computationally more expensive when compared to standard models, with the overall cost increasing with the complexity and size of the graph. To alleviate this limitation, we have also introduced new ways of reducing the computational cost of using a GNN when dealing with reactions in condensed phases.

In the field of enhanced sampling, it is customary to use as a test the conformational equilibrium of alanine dipeptide. We will stick to this tradition and start with the study of this molecule. Having passed the dipeptide test, we will demonstrate the usefulness of this new approach in a number of more physically relevant examples in which the solvent plays an important role. The examples studied are the dissociation of NaCl and of CaCO_3_ in water, and the binding of an organic molecule to calixarene, a simplified but still representative model of drugprotein interaction. All these systems have been studied with other means and thus they provide a good testing ground also for the versatility of the committor since their reactive processes can exhibit multiple reaction pathways and/or metastable intermediate states. ?−? ? ? ?

Methods

II

Background

II. A

To learn the committor q(** x **), we use the Kolmogorov variational principle, ?,? which implies minimizing the functional

[eqn]

under the boundary conditions q(** x ** _ A ) = 0 and q(** x ** _ B ) = 1, where ** x ** _ A _ and ** x ** _ B _ denote an initial and final configurations taken from states A and B, while ** u ** indicates mass scaled coordinates. The average ⟨·⟩ U(** x **) in eq is taken over the Boltzmann ensemble of the studied atomistic system, which we assume to interact via the potential U(** x **).

However, the evaluation of $[eqn]$ is in practice far from trivial. In fact, the committor has a step-like structure that raises rapidly from q(** x ) ≈ 0 when ** x ** ∈ A to q( x ) ≈ 1 when ** x ** ∈ B, as it goes through the transition state (TS) region. This makes the term |∇_ ** u ** _ q( x **)|^2^ sharply peaked on the TS region, which unfortunately is hard to sample, since it is seldom visited when dealing with rare events due to the presence of large energetic barriers.

In order to get around this sampling issue, ?,? we thus resort to enhanced sampling. In ref. ?, to enhance sampling of the otherwise elusive TS region, we introduced the bias potential $[eqn]$

[eqn]

where β is the inverse temperature. Exploiting the above-discussed localization of the committor gradients, such a bias is able to stabilize the TS region and turn it into a minimum that can be sampled as extensively as a standard metastable state. In addition, to promote also transitions between the states and favor ergodic sampling, in ref. ?, we complemented this approach with a metadynamics-like bias using the on-the-fly probability enhanced sampling ?,? (OPES) based on a committor-derived collective variable (CV). Even if the committor is believed to be formally the best possible reaction coordinate, its direct use as a CV presents numerical problems that make such an approach ineffective. For this reason, we make what basically amounts to the change of variable q(** x ) = σ(z( x )), where σ(z) = 1/(1 + e ^–pz ^) and use z( x ) as CV. Since σ(z( x )) a monotonous function, z( x ) encodes the same information as q( x ), but avoids the numerical issues that are a consequence of the sharp behavior of q( x ) and of the fact that, being q( x **) a probability, it can assume values that are smaller than the limit of machine precision.

The solution to the Kolmogorov variational problem is found in an iterative process that starts from an initial guess for the committor. In ref. ?, the initial guess was constructed using data collected in two unbiased simulations performed in the A and B basins. Such an initial guess does not have any information on the TS states, and as such, this initial guess is not optimal. A better starting guess, which will be used here, is to start the self-consistent procedure, using also data coming from the TS. Such data can be obtained, for example, from metadynamics simulations, even if driven by suboptimal CVs.

One important feature of the committor is that it provides a powerful analysis tool for understanding the reactive processes. One crucial element in this regard is the identification of those configurations defining the transition state ensemble (TSE). Following the spirit of our approach, we have proposed to identify the TSE using the Kolmogorov distribution $[eqn]$ , defined as

[eqn]

This distribution indeed measures the contributions that trajectories passing by a configuration ** x ** bring to the transition rate ν_ R _, since

[eqn]

The advantages of using this definition of TSE are discussed in ref. ? and in Section with examples.

Machine Learning the Committor with GNNs

II. B

As anticipated in the introduction, we change the model used to parametrize the committor and, rather than using a feed-forward neural network with a set of descriptors as previously done, ?,? we use a graph neural network (GNN) that directly processes the atomic Cartesian coordinates. In addition, we add to the original loss a few regularization terms aimed at stabilizing and simplifying the training. Apart from these modifications, the procedure is the same as described in our previous work. ?,?

Loss Function

II.B.I

The loss function employed here is composed of three terms

[eqn]

where the hyper parameters α_1_ and α_2_ regulate the relative strength of the different terms. In the first term, $[eqn]$ is the estimate based on a training set of N v configurations ** x ** _ i _ of the functional $[eqn]$

[eqn]

Since most of the time the data will come from biased simulations, a statistical weight w _ i _ is associated with each ** x ** _ i _ so as to give each data its statistically correct contribution. As in ref. ?, we do not insert in the loss function directly $[eqn]$ but its logarithm to improve training stability since $[eqn]$ can vary by many orders of magnitude. This change does not alter the minimum of the functional, since $[eqn]$ is positive definite and the logarithm is a monotonously growing function, but avoids numerical problems since $[eqn]$ can vary by very many orders of magnitude.

The second term, L _ b _, enforces the correct boundary conditions q(** x ** _ A _) = 0 and q(** x ** _ B _) = 1

[eqn]

evaluated on a labeled data set of N _ A _ and N _ B _ configurations belonging to A and B respectively. We note that there is no need to introduce additional collective variables to distinguish the metastable states when generating the initial labeled data set, since short unbiased trajectories initiated from either A or B will remain confined to the initial basins in a rare event scenario.

Finally, dealing with the greater expressivity of GNN-based models, we realized that it was better to control the range of accessible z values, which in principle could have been (−∞, +∞), and avoid overfitting the metastable states. We thus introduced the regularization term $[eqn]$ to control the range of the z value accessible and to make optimization more balanced.

[eqn]

where z _ r _ is a user set threshold value and z _ i _ the configurations’ predicted z value.

GNN Model

II.B.II

As GNN architecture, we adopt that of SchNet? proposed by Schütt et al., which provides a good balance between computational cost and expressivity. In addition, to better understand the relationship between neighboring nodes, in examples of the CaCO_3_ dissociation and the ligand binding examples, we have applied a node-level attention mechanism to filter out the less relevant messages coming from neighboring nodes.

SchNet is a message-passing graph neural network designed for atomistic systems, in which the nodes of the graph represent the atoms in the system. To each node is also associated a set of features that are initialized based on the corresponding atom type. Such features are then updated through the network via interaction layers modeled as continuous convolution filters.

In each interaction block, the message passed to the atom i from the atom j is given by

[eqn]

where ** h ** _ j _ is the feature array of atom j, ** W ** is a trainable linear transformation, and f θ ^ F ^(·) is a filter network based on the pairwise interatomic distance expanded by a set of radial basis functions (RBF). The sum of messages from all neighbors is then used to update the representation of atom i

[eqn]

where the j index runs over the N _ n _ nodes in the neighborhood $[eqn]$ , and f θ ^ M ^(·) is a network that processes the total message received by each node.

Instead, if the attention mechanism is employed, the update function is modified as

[eqn]

where α_ ij _ is the trainable attention score? computed as

[eqn]

in which g θ is a gate network that returns a scalar value for each message incoming to node i. To summarize, the incorporation of such an attention mechanism enables the network to assign varying importance to neighboring nodes based on their features and spatial relationships, rather than treating all neighbor messages equally.

As anticipated in the previous section, following ref. ?, we express the committor as q(** x ) = σ(z( x **)), with σ being the activation function described above. More in detail, the value of z is computed by first applying a pooling operation (i.e., average) to the node features ** h ** _ i _ of the last GNN layer and then feeding this to a readout network f θ ^ R ^ according to

[eqn]

where N _ p _ is the number of nodes involved in the pooling operation. This choice proved to be more stable than what was done in our previous work,? in which the readout function was applied before the pooling operation. In the most general case, the pooling operation runs on all the nodes in the graph, whereas, in the case of the truncated graph discussed in the next section, the pooling operation is applied to a subset of the most relevant atoms only.

Improving Efficiency

II.B.III.

An important contribution to the overall efficiency and eventual applicability of our approach comes from the way the model input graph is constructed. In fact, one of the main limitations of GNN models is their greater computational cost, which scales rapidly with the number of atoms included in the graph. Often, it is not strictly necessary to include in the graph all the atoms in the system. For example, when dealing with chemical reactions in a solvent, heterogeneous catalysis, or protein ligand binding, only the molecules close to the reactants take an active part in the reaction and should thus be taken into account, whereas the rest of the solvent molecules can be safely ignored.

To take advantage of these considerations, we use the dual neighbor list schematically depicted in Figure for the case of a molecule dissociation in water, which is structured as follows. First, the atoms in the system are divided by the user into two nonoverlapping categories. The reacting atoms, which are always included in the graph and the environment atoms, which are included in the graph only if close to the reacting atoms. In practice, a first neighbor list NL_out_ is built based on a cutoff R _ t _ with respect to the reacting atoms (panel A), then the graph is constructed by associating nodes to all the atoms included in NL_out_ and connecting them with edges according to a second neighbor list NL_graph_ based on a cutoff R _ c _ (panel B). Importantly, the cutoffs are chosen R _ t _ = R _ c _ + Δ_ b _ to always have a buffer region (Δ_ b _ > 0) of environment atoms not directly interacting with the system, thus stabilizing the calculation between the neighbor lists updates. This is similar to what is done in molecular dynamics when updating neighbor lists.

*Truncated graph. Schematic representation, using the dissociation of NaCl as an example, of the truncated graph construction to reduce the computational cost of using GNN-based methods when a few reacting atoms (NaCl) interact with a large number of environment atoms (H2O). Only the atoms belonging to the neighborhood defined by the cutoff radius R

t from reacting atoms are associated with graph nodes, whereas the other are neglected (A and C). Such nodes are then connected with edges based on the cutoff R

c < R

t (B and C). To avoid having reacting atoms in disconnected graphs, edges between reacting atoms can be enforced regardless of the distance (D). The whole truncated graph is processed through the GNN model, but only the reacting atoms are considered in the readout function to obtain the final CV output.*

To avoid atoms belonging to disconnected graphs based solely on R _ c _, as it could easily happen when two ions or a ligand and a guest are far apart in a solvent (panel C), connectivity between reacting atoms can be guaranteed by defining their interaction based on fixed edges between them (panel D).

The use of this truncated graph approach, already for systems of relatively small sizes, such as the NaCl studied in this paper (i.e., 216 water molecules), we find great computational savings both in the training and the simulation stage.

GNN Interpretability

II.C

The use of a geometric GNN to model atomistic systems brings several advantages in terms of the physical insight that can be extracted once the model is optimized. For example, as atoms are directly mapped to graph nodes, performing a node-level sensitivity analysis can clearly identify which atoms are most relevant for the process in a simple way. Here, we perform such an analysis in two ways. In one, we followed the node sensitivity analysis proposed in ref. ?, in which the modulus of the derivatives of the model output with respect to the Cartesian coordinates of a node is taken as a measure of its relevance. In practice, considering a data set of N _ g _ graphs, the sensitivity s _ i _ of the ith node is computed as

[eqn]

Where, $[eqn]$ is the jth entry of the data set, and ** x ** _ i _ ^ j ^ is the position of the ith node in $[eqn]$ .

Alternatively, when an attention layer is employed, one can use the attention scores α_ ij _ as a measure of node relevance of the i, j interaction (see eq).

Another important advantage of GNN models is that, starting from simple atomic coordinates, they inherently learn a representation of the system that encodes structural information in the hidden node features of the model. In fact, a distance metric, for instance, the Euclidean distance, can be defined in the latent space identified by such features that can be used to perform unsupervised analysis, for example, based on clustering algorithms or other dimensionality reduction algorithms.

Results

III

The technical details of the simulations, GNN hyperparameters, training procedures and computational cost for the examples presented are provided in the Supporting Information (SI).

Alanine Dipeptide

III. A

In most studies devoted to rare event sampling, the transition of alanine dipeptide in vacuum between the C7_eq_ (A) and C7_ ax _ (B) conformers is commonly used to demonstrate the effectiveness of new methods. This system is one of the most extensively studied models for rare events and is often characterized by two dihedral angles, ϕ and ψ. Following ref. ?, we begin with unbiased simulation data sampled in the two metastable basins. To build the input for our GNN model, we consider the 10 heavy atoms of the molecule, which serve as the nodes in the input graph.

Using the optimized GNN model, which reaches convergence within 4 iterations (see SI for details), to drive our enhanced sampling simulations, we obtain an estimate of the free energy surface (FES) in close agreement with reference results obtained by biasing the conventional ϕ, ψ. In addition, we can automatically identify the key atoms involved in the transition from the GNN architecture. In particular, a node sensitivity analysis reveals that atoms 3, 4, 5, and 6 contribute most significantly to the learned committor (see FigureB). As expected, these are the atoms closely related to the dihedral angles ϕ and θ, in full agreement with previous studies. ?,? The two-dimensional committor map projected on two backbone dihedral angles ϕ and ψ and other detailed discussion can be found in Figure S5.

Alanine dipeptide. (A) Relevant torsional angles of alanine dipeptide. (B) Relative relevance, given by the color scale, to the GNN-based committor model of the graph nodes associated with the heavy atoms. (C) Free energy surface (FES) projected along the ϕ torsional angle obtained with the GNN-based setup proposed in this work (blue solid curve) and a reference OPES simulation using ϕ and ψ as CVs (red dashed curve).

Calixarene

III. B

Having paid homage to tradition, we now study the interaction of the G_2_ ligand(4-cyanobenzoic acid) with an octa-acid calixarene host (OAMe) from the SAMPL5 challenge,? which provides a good example of some features of protein–ligand systems.

Previous studies have shown that the system primarily follows two transition pathways: a “wet” path, in which the process proceeds through the semibound state B w before reaching the fully bound state B d, and a “dry” path, where guest molecules enter and exit the pocket directly. In addition, the wet path has been shown to be dominant even if it involves more intricate dynamics. Here, we shall not repeat the calculation of ref,? but focus only on one of the rate-limiting step, i.e., the transition from the B w to B d with the purpose of illuminating the role of water in this transition. The results obtained with the setup presented in this work for the whole process are available in the SI, including the overall free energy surface and the estimate for the binding energy.

For the B w to B d rate-limiting step, our algorithm finds two pathways, one in which the ligand is solvated while at the same time the calixarene is emptied of water and eventually binds to the now dry calixarene (see SI Figure S10). However, as the contribution of this path to the rate is negligibly small (<5%, based on analysis of $[eqn]$ contributions), we consider here only the dominant path (see Figure). In this dissociation mode, the water that is trapped in the calixarene in the B w state accompanies the ligand to the entrance of the calixarene until it is released into the bulk solvent, leaving room for the ligand to bind to its lowest free energy state.

Calixarene. Snapshots of representative configurations along the dominant reaction pathway of B w to B d in the binding process of the G2 ligand (orange) to the OAMe host molecule (gray) in water, grouped according to the z value. The water molecules are represented as blue sticks, depicted in transparency when outside the binding cavity and in solid color when inside.

NaCl

III. C

We now discuss the case of NaCl dissociation in water, in which the solvent clearly plays a central role. The contact ion pair (CIP) is our state A, while B is the dissociated state. To focus on the dissociation process itself, we set an artificial repulsive wall that limits the interionic distance at 6 Å. This implies that in the calculation of the free energy of the B state, part of the entropic contribution is missing.

In our previous work ?,? and also in the case of alanine dipeptide, we started the committor learning self-consistent procedure with two unbiased simulations started in A and B, i.e., without any data whatsoever coming from the transition state region. Despite being a procedure of general applicability, this is often an overkill since one can often obtain some information on the TS by running first a metadynamics-like simulation, even if driven by a suboptimal CV. To showcase this strategy, we start with data coming from an OPES simulation in which the interionic distance was used as CV. Even for a simple example like this, this approach allowed reducing the number of iterations needed to reach convergence (see SI, Section S4).

In FigureA, we project the free energy surface onto two physically transparent quantities, the ion–ion distance d NaCl and the number of bridging water molecules that coordinate both ions at the same time n _ B _.

NaCl dissociation. (A) Computed free energy surface (FES), indicated by the color map, in the space defined by the number of water molecules bridging the ions n b and by the interionic distance d NaCl. The two reactive pathways from the associated states are indicated by dashed lines. (B) Distribution of the Kolmogorov probability pK , indicated by the color map, in the n b and d NaCl plane. The isolines of the free energy of A are superimposed in white as a reference. The insets show representative configurations of the transition states that characterize the two reaction pathways.

Usually, the transition state ensemble is defined as the set of ** x ** for which q(** x **) ≈ 0.5, however, as discussed in refs ?,?,? , it is better to base our analysis on the transition state ensemble as defined by $[eqn]$ , which gives the probability that a state ** x ** is actually visited along the reactive trajectory. In this case, the $[eqn]$ distribution has a multimodal structure in which each mode can be classified by the number of water molecules n b that are shared by the two ions (see FigureB). Using eq and measuring the integrals over the two main peaks of $[eqn]$ , we can estimate that about 50% of the reactive trajectories pass via the n b ≈ 1 transition state and 40% via n b ≈ 2. The remaining 10% contribution comes from the minority paths through n b ≈ 0 and 3.

For this relatively simple system, the above analysis, based on our physical understanding of the system, we could interpret the results rather straightforwardly. However, in more complex systems, this may not have been so straightforward. For this reason, we show that using the information encoded in the GNN, one could have arrived at the same results in an unsupervised way. Thus, we repeated the $[eqn]$ analysis, performing a k-medoid clustering? using as metric the Euclidean distance between the node features of the last GNN layer (see Sec. II C). We find again that two modes dominate the $[eqn]$ distribution, which correspond to the two possible screening configurations described above, where the ions are screened by one or two water molecules.

CaCO3

III. D

The final test of our method is the study of the dissociation of CaCO_3_ in water, which is another nontrivial example of a chemical process in solution that further showcases the power of the GNN-based approach. For this system, we take as state A the contact ion pair in which Ca^2+^ and CO_3_ ^2–^ form a nearly planar structure close to the gas-phase equilibrium geometry, with the Ca^2+^ ion symmetrically positioned in front of two carbonate oxygen atoms. In such a state, the screening cloud of Ca^2+^ is composed by five water molecules and two carboxylic oxygens. The model potential used at the simulated system size predicts that two major solvation structures (B_6_ and B_7_) are possible. Such structures are almost degenerate: in B_7_ the cation is surrounded by 7 water molecules arranged in a pentagonal bipiramid, while an octahedron of 6 waters forms the solvation shell of B_6_. The fact that the final state is not unique might at first seem like a major problem for a committor based approach, in which the final state has to be specified beforehand, but in fact using just B_7_ as final state we were still able to discover B_6_ and to obtain a free energy surface (see FigureA) in agreement with previous metadynamics-based investigations.?

CaCO3 dissociation. (A) Free energy surface (FES), indicated by the color map, in the plane defined by the interionic distance d Ca–C and the number of solvating water molecules around the Ca2+ ion n w. The metastable states are indicated on the FES, and representative snapshots are provided in the insets. The position of the transition states is indicated by a star. (B) Medoid configurations for the two transition state ensemble clusters TS1 and TS2. (C) Average attention scores Att(r) of messages from water oxygen nodes to the Ca2+ node as a function of the interatomic distance d Ca–Ow (red line). The radial distribution function of water oxygens with respect to Ca2+ is reported as a reference (blue line). The two curves are both normalized to be in the range (0,1).

The resulting FES (see FigureA) shows that, starting from state A, the system passes via an intermediate state I 1 in which the number of solvating water n w = 6. From this state, two reactive paths branch out leading to the two different final states described above. The intermediate state I 1 is characterized by the loss of the planarity characteristic of A and the replacement of one of the coordinating carbonate oxygen atoms by one water molecule. From I 1, one can go either directly to B_6_ or to B_7_ via the intermediate I 2 (n w = 7), which corresponds to a near-dissociated configuration in which Ca^2+^ is fully coordinated by seven water molecules and the now distant carbonate anion no longer directly contributes to its screening. However, in this state, the solvation shells of the two ions still share two water molecules.

The rate-limiting step of the whole process is the transition from A to I 1, which can take place in two ways, as revealed by the bimodality of the Kolmogorov distribution $[eqn]$ associated with the transition from A to I 1 (see Figure S13). The dominant ensemble of paths passes via TS_1_ (see FigureB), with an interchange mechanism in which the distance between Ca^2+^ and CO_3_ ^2–^ increases, facilitating the arrival of a new water molecule. Instead, another less probable path goes via TS_2_ (see FigureB) and can be described as an associative substitution in which both carbonate oxygens remain coordinated, while the calcium-solvating waters temporarily adopt an eight-coordinated geometry.

It is interesting to note that this picture is strengthened by an analysis of the representation learned by the trained GNN. In fact, the attention mechanism identifies the water molecules that are most relevant to the Ca^2+^ coordination (see Section for details). In FigureC, we plot together, as a function of the Ca–O distance, the pair correlation function and the attention weight distribution. Unsurprisingly, we see that the primary hydration shell at 2.3 Å is identified as highly important, but that a secondary peak at 2.9 Å is also relevant. This peak is related to the water molecules that take part in the TS_1_ ligand-exchange processes, discussed earlier. We also observe that the attention distribution starts to be different from zero and rather large as soon as the pair correlation is different from zero, reflecting the fact that the solvation waters at the smaller Ca–O distances are more effective at screening the cation and that such an effect is correctly encoded in the attention distribution.

Discussion

IV

The combination of our recent committor-based approach with the expressivity and generality of the GNN architectures offers a new, powerful tool for the semiautomatic study of rare events. This approach is descriptor-free, and the graph-based architecture provides new and powerful analysis tools to dissect in detail the reactive process under study. Such possibilities range from a simpler identification of relevant atoms, thanks to the one-to-one correspondence between reacting atoms and graph nodes, to the use of the model learned hidden representation as the basis for further analysis or the direct study of the information encoded into the attention layers.

Furthermore, we have also shown that the committor iterative optimization procedure ?,? can be improved and accelerated if one uses data from preliminary imperfect enhanced sampling simulations. In addition, we believe that our proposed suggestion to reduce the computational cost associated with the use of GNNs for enhanced sampling will help make its routine use more accessible, as it has happened in other fields, such as the construction of interatomic potentials.

Supplementary Material

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Hénin J.Lelièvre T.Shirts M. R.Valsson O.Delemotte L.Enhanced Sampling Methods for Molecular Dynamics Simulations Living J. Comput. Mol. Sci.20224158310.33011/livecoms.4.1.1583 · doi ↗
2Trizio E.Kang P.Parrinello M.Everything everywhere all at once: a probability-based enhanced sampling approach to rare events Nat. Comput. Sci.2025511010.1038/s 43588-025-00799-540325221 · doi ↗ · pubmed ↗
3Kang P.Trizio E.Parrinello M.Computing the committor with the committor to study the transition state ensemble Nat. Comput. Sci.2024411010.1038/s 43588-024-00645-038839932 · doi ↗ · pubmed ↗
4Deng, Y. ; Kang, P. ; Xu, X. ; Li, H. ; Parrinello, M. The role of fluctuations in the nucleation process ar Xiv, preprint ar Xiv:2503.20649, 2025.10.1073/pnas.2526954123 PMC 1286774041615744 · doi ↗ · pubmed ↗
5Das S.Raucci U.Trizio E.Kang P.Neves R. P.Ramos M. J.Parrinello M.A Machine Learning-Driven, Probability-Based Approach to Enzyme Catalysis ACS Catal.2025159785979210.1021/acscatal.5c 02431 · doi ↗
6EW.Vanden-Eijnden E.Transition-path theory and path-finding algorithms for the study of rare events Annu. Rev. Phys. Chem.20106139142010.1146/annurev.physchem.040808.09041218999998 · doi ↗ · pubmed ↗
7Ma A.Dinner A. R.Automatic method for identifying reaction coordinates in complex systems J. Phys. Chem. B 20051096769677910.1021/jp 045546 c 16851762 · doi ↗ · pubmed ↗
8Bolhuis P. G.Chandler D.Dellago C.Geissler P. L.Transition path sampling: Throwing ropes over rough mountain passes, in the dark Annu. Rev. Phys. Chem.20025329131810.1146/annurev.physchem.53.082301.11314611972010 · doi ↗ · pubmed ↗