TL;DR
This paper develops a machine learning-based method to accurately model small molecules' potential energy surfaces and forces, enabling detailed molecular dynamics simulations with near-quantum accuracy.
Contribution
It introduces the symmetrized gradient-domain machine learning (sGDML) approach for constructing high-accuracy force fields from limited data, applicable to small molecules.
Findings
sGDML accurately reproduces high-level ab initio forces
The method captures complex electronic interactions without restrictions
Molecular dynamics reveal new insights into small molecule behavior
Abstract
We present the construction of molecular force fields for small molecules (less than 25 atoms) using the recently developed symmetrized gradient-domain machine learning (sGDML) approach [Chmiela et al., Nat. Commun. 9, 3887 (2018); Sci. Adv. 3, e1603015 (2017)]. This approach is able to accurately reconstruct complex high-dimensional potential-energy surfaces from just a few 100s of molecular conformations extracted from ab initio molecular dynamics trajectories. The data efficiency of the sGDML approach implies that atomic forces for these conformations can be computed with high-level wavefunction-based approaches, such as the "gold standard" CCSD(T) method. We demonstrate that the flexible nature of the sGDML model recovers local and non-local electronic interactions (e.g. H-bonding, proton transfer, lone pairs, changes in hybridization states, steric repulsion and …
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6| Dataset | Energy Prediction | |||
|---|---|---|---|---|
| DFT | CCSD(T) | |||
| Molecule | MAE | RMSE | MAE | RMSE |
| keto–MDA | ||||
| Ethanol | ||||
| enol–MDA | ||||
| Aspirin | ||||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Molecular Force Fields with Gradient-Domain Machine Learning: Construction and Application to Dynamics of Small Molecules with Coupled Cluster Forces
Huziel E. Sauceda
Fritz-Haber-Institut der Max-Planck-Gesellschaft, 14195 Berlin, Germany
Stefan Chmiela
Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
Igor Poltavsky
Physics and Materials Science Research Unit, University of Luxembourg, L-1511 Luxembourg, Luxembourg
Klaus-Robert Müller
Machine Learning Group, Technische Universität Berlin, 10587 Berlin, Germany
Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, Korea
Max Planck Institute for Informatics, Stuhlsatzenhausweg, 66123 Saarbrücken, Germany
Alexandre Tkatchenko
Physics and Materials Science Research Unit, University of Luxembourg, L-1511 Luxembourg, Luxembourg
Abstract
We present the construction of molecular force fields for small molecules (less than 25 atoms) using the recently developed symmetrized gradient-domain machine learning (sGDML) approach [Chmiela et al., Nat. Commun. 9, 3887 (2018); Sci. Adv. 3, e1603015 (2017)]. This approach is able to accurately reconstruct complex high-dimensional potential-energy surfaces from just a few 100s of molecular conformations extracted from ab initio molecular dynamics trajectories. The data efficiency of the sGDML approach implies that atomic forces for these conformations can be computed with high-level wavefunction-based approaches, such as the “gold standard” CCSD(T) method. We demonstrate that the flexible nature of the sGDML model recovers local and non-local electronic interactions (e.g. H-bonding, proton transfer, lone pairs, changes in hybridization states, steric repulsion and interactions) without imposing any restriction on the nature of interatomic potentials. The analysis of sGDML molecular dynamics trajectories yields new qualitative insights into dynamics and spectroscopy of small molecules close to spectroscopic accuracy.
pacs:
Valid PACS appear here
††preprint: APS/123-QED
I Introduction
Molecular force fields (FFs) constitute one of the most important tools in chemistry, biology, and materials modelling due to their remarkable value in understanding systems that range from small molecules, e.g. ethanol with 9 atoms, up to large proteins, aiding in the exploration and the discovery of new materials and drugs. Creating physically inspired and handcrafted interatomic potentials with parameters fitted to experimental data or quantum-mechanical calculations has been a common practice since the early works on molecular dynamics Alder and Wainwright (1959); Rahman (1964); Verlet (1967); Rahman and Stillinger (1971). The complexity of creating reliable interatomic interaction models using prior physical knowledge led to the development of dedicated specialized FFs for different material classes, including tight-binding potentials for semiconductors and metals Daw and Baskes (1984), Tersoff potential for covalent materials Tersoff (1988), polarizable FFs Warshel et al. (2006), the TIPnP FFs for water Jorgensen et al. (1983); Mahoney and Jorgensen (2000), and a wide variety of biomolecular FFs such as AMBER, CHARMM, MMFF, and GROMOS that often lead to reliable results for folded protein structures under ambient conditions Weiner and Kollman (1981); Brooks et al. (1983); Halgren (1996); Soares et al. (2005). The wealth of available interatomic potentials illustrates the vast amount of fundamentally different material classes, exposing that treating different types of interactions (metallic bonding, covalent chemistry, hydrogen bonding, non-covalent interactions, etc.) in a unified and seamless fashion is a complex challenge for handcrafted mechanistic FFs.
To resolve some of these challenges, a number of recently developed machine learned (ML) FFs exploit the redundant information contained in datasets of ab initio calculations or molecular dynamics trajectories to reconstruct the underlying potential energy surface (PES) without imposing any particular handcrafted analytical form for the interatomic potential.
In particular, a vast amount of work has been done in molecular representations Rupp et al. (2012); Bartók et al. (2010); Hansen et al. (2013); Bartók and Csányi (2015); Hansen et al. (2015a); Rupp et al. (2015); De et al. (2016); Artrith et al. (2017); Bartók et al. (2017); Glielmo et al. (2017); Yao et al. (2017); Faber et al. (2017); Eickenberg et al. (2018); Glielmo et al. (2018); Grisafi et al. (2018); Tang et al. (2018); Pronobis et al. (2018); Faber et al. (2018), neural networks architecture development Behler and Parrinello (2007); Jose et al. (2012); Behler (2016); Gastegger et al. (2017); Schütt et al. (2017, 2018, 2017); Ryczko et al. (2018); Zhang et al. (2018), data sampling Li et al. (2015); Podryabinkin and Shapeev (2017); Dral et al. (2017); Mardt et al. (2018); Noé and Wu (2018) and inference methods Bartók et al. (2013); Montavon et al. (2013); Botu and Ramprasad (2015); Brockherde et al. (2017); Huan et al. (2017); Bereau et al. (2018); Lubbers et al. (2018); Kanamori et al. (2018); Hy et al. (2018); Smith et al. (2017); Wang et al. (2018); Winter et al. (2019); Christensen et al. (2018); Chmiela et al. (2017, 2018a), as well as software development Chmiela et al. (2018b); Yao et al. (2018); Schütt et al. (2019) and explanation methods Alber et al. (2018); Meila et al. (2018).
Based on rigorous statistical learning theory Friedman et al. (2001); Vapnik (1995), machine learning provides a powerful and general framework for constructing force fields, since ML approaches can reconstruct complex high-dimensional objects with arbitrary accuracy, provided that sufficient data samples (molecular energies and atomic forces) are used for training. Obviously, the computational cost of evaluating ML FFs lies in between empirical FFs and ab initio reference calculations. In particular, the sGDML approach employed here is 5-10 orders of magnitude faster than ab initio calculations and 2-3 orders of magnitude slower than classical FFs, this depends of course on the molecular system. For example, in the case of a CCSD(T)/cc-pVTZ calculation, the sGDML model can be up to 107 and 109 times faster for malondialdehyde and aspirin, respectivelyChmiela et al. (2018b).
In the broadest sense, the challenge of accurately learning FFs is currently being addressed using two methods: Neural Networks (NN) Behler and Parrinello (2007); Behler et al. (2007); Behler (2011a, b); Jose et al. (2012); Behler (2016); Gastegger et al. (2017); Schütt et al. (2017, 2018) and kernel-based models Bartók et al. (2013); Bartók and Csányi (2015); Botu and Ramprasad (2015); Rupp et al. (2015); Li et al. (2015); Eickenberg et al. (2018); Podryabinkin and Shapeev (2017); Glielmo et al. (2017); Mardt et al. (2018); Christensen et al. (2018); Chmiela et al. (2017, 2018a). Both approaches can be constructed to employ energy and/or force information. Learning forces is advantageous for several reasons: (i) FFs reconstruction in the force domain yield smoother PESs, eliminating artifacts due to somewhat conflicting requirement of simultaneously reproducing accurate energies and forces Schütt et al. (2018); Chmiela et al. (2018b), the inherent uncertainty of the learning process, and using biased models that introduce unphysical approximations, e.g. atomic partitioning of the energy, (ii) obtaining energies from force models tends to diminish the noise in the prediction as a result of the integral operator, contrasting the behaviour of the forces generated by the gradient operator on energy models, and (iii) force models require smaller amounts of reference calculations to reach a desired accuracy Chmiela et al. (2017). Such data efficiency arises not only due to the fact that each force data point carries 3 components (where is the number of atoms) per reference calculation, but also because those components are orthogonal, thus providing complete information about the immediate local environment Solak et al. (2003).
Both NN and kernel-based methods can achieve formally any desired accuracy of predictions whenever a sufficiently large amount of training data is available. In contrast, when only 100s of data points are available, as in the case of high-level ab initio data, the kernel methods usually offer a better reconstruction efficiency (with a unique and well-defined solution) as they make greater use of prior information. Finally, it is important to emphasize the mandatory requirement of generating conservative ML-FFs, i.e. , to guarantee stable simulations.
The symmetric Gradient Domain Machine Learning (sGDML) FF Chmiela et al. (2018a) retains all the advantages of kernel-based ML models which directly learn forces. In fact, training the sGDML model solely using forces, besides the availability of molecular energies, improves the learning process. Given that there is evidence that combining energies and forces in the loss function degrades the quality of the force prediction Chmiela et al. (2018b); Schütt et al. (2018). The robustness of the method is explained by the fact that all atomic interactions are modelled globally, without resorting to an inherently non-unique partitioning into atom-wise, pairwise or many-body contributions. In the sGDML model: (i) Each FF model is explicitly constrained to be energy conserving, and (ii) The model complexity is further reduced through the incorporation of molecular symmetries (i.e. rigid and fluxional) that are automatically extracted from the reference dataset. All these important properties contribute to the ability of sGDML to reconstruct complex PES for molecules of intermediate size from modest amounts of reference data, an unfeasible task for non-dedicated molecular FFs. In particular, the sGDML model enables the reconstruction of CCSD(T)-quality FFs from a limited amount ( 100s) of reference molecular configurations Chmiela et al. (2018a).
In this article, we analyze some of the relevant quantum effects captured by the sGDML model while reconstructing the PES of small molecules. First, in section II, we give a short introduction to the GDML framework and its symmetrized version, the sGDML model. In section III a discussion regarding the advantages of gradient domain-based FFs is presented, as well as an analysis of dedicated vs. transferable FFs. Then, in section IV, we describe some of the quantum interactions described by the sGDML’s reconstructed PES. In particular, we focus on three ubiquitous phenomena of general interest: section IV-A) lone-pairs and electrostatic interactions, section IV-B) intramolecular hydrogen bonds and proton transfer, and section IV-C) changes in atomic hybridization state and interactions Choudhary et al. (2011); Blanco and López (2018). Note that a qualitative description of these complex interactions by regular FF would require highly specialized models, while the sGDML model captures every interaction encoded in with high accuracy. In the last section, we summarize our findings.
II sGDML model
The sGDML model enables the direct efficient construction of dedicated FFs for flexible molecules from high-level ab initio calculations. Unlike traditional FFs, it imposes no hypothesized analytical interaction models and thus in principle can model any physical interaction. Compared to other machine learning approaches, sGDML achieves high data efficiency through the incorporation of spatial and temporal physical symmetries. Global spatial symmetries include rotational and translational invariance of the energy, in addition to rigid and fluxional symmetries, which are recovered and enforced in an automatic data-driven manner. The homogeneity of time implies energy conservation, property that is enforced by learning in the gradient domain using as prior an analytically-integrable covariance function.
The latter is introduced as a linear operator constraint, by modeling the FF as the transformation of an underlying energy model. In particular, we train the gradient of a kernel ridge estimator on force labels , which – by construction – yields energy-conserving FFs that can be integrated to obtain the corresponding Born-Oppenheimer (BO) global potential-energy surface (PES) Chmiela et al. (2017, 2018a). Practically, this is achieved via the use of the Hessian matrix of a kernel as the covariance structure to solve the normal equation of the ridge estimator in the gradient domain Chmiela et al. (2017, 2018a):
[TABLE]
where is the kernel matrix, is the regularization parameter, is the identity matrix, and are the parameter-vectors to train.
The sGDML model imposes additional permutational symmetry constraints on to take advantage of PES redundancies due to rigid space group and fluxional symmetries Chmiela et al. (2018a). Typically, extracting those symmetries requires chemical and physical intuition about the system at hand, e.g. rotational barriers, which is impractical in a ML setting. Through a data-driven multi-partite matching approach (see Fig. 1), we automate the discovery of permutation matrices corresponding to the index permutation from molecular dynamics simulations by realizing the assignment between adjacency matrices of molecular graph pairs and in different energy states,
[TABLE]
The resulting approximate pairwise matchings are subsequently synchronized using transitivity within the training set as the consistency criterion. A particular advantage of our solution is its ability to limit the symmetry recovery to energetically feasible permutational configurations, given that unfeasible permutation, e.g. the permutation of two random atoms, would not contribute any valuable information the symmetrized kernel and should not be considered. This severely reduces computational efforts in evaluating the model. Finally, the FF estimator trained on reference geometries, with partial derivatives and symmetry transformations each, takes the form
[TABLE]
The corresponding energy predictor is obtained by simply integrating with respect to the Cartesian coordinates,
[TABLE]
Due to linearity of integration, the expression for the energy predictor is identical up to second derivative operator on the kernel function (see Fig. 1). Figure 1 gives a general perspective of the sGDML model by summarizing the training process, from sampling the MD trajectory and extracting the permutational symmetries to solving the linear system and reconstructing the embedded PES in the data.
The addition of spatial, temporal and permutational symmetry constraints leads to a gain in data efficiency of more than two orders of magnitude Chmiela et al. (2018a). Recently, we have systematically demonstrated that sGDML models trained on only few 100s of reference structures reconstruct molecular PESs with a mean average error of less that 0.06 kcal for small molecules with up to 15 atoms and less than 0.16 kcal for molecules as complex as aspirin, paracetamol, and azobenzene Chmiela et al. (2018a) (See Table 1, Tables S1 and S2). Hence, the explicit symmetrization incorporated in the GDML framework Chmiela et al. (2017) results in robust learning models with the ability to preserve the complex subtleties encoded in the reference data.
The sGDML models for each molecule studied in this article were initially trained on DFT data at the generalized gradient approximation (GGA) level of theory with Perdew-Burke-Ernzerhof (PBE) Perdew et al. (1996) exchange-correlation functional and the Tkatchenko-Scheffler (TS) method Tkatchenko and Scheffler (2009) to account for van der Waals interactions. The training dataset was created by subsampling MD trajectories at constant temperature (500K) using the FHI-aims package Blum et al. (2009). In the case of keto-MDA, enol-MDA and ethanol we recomputed the training configurations using all-electron CCSD(T), while in the case of Aspirin we used all-electron CCSD Turney et al. (2012); Parrish et al. (2017); Smith et al. (2018)(see Supporting Information for further detains).
III Comparison of sGDML to other ML-FF approaches
III.1 Force vs. energy model
The unique approach used by the sGDML model contrasts other models that first develop an energy function and then get the forces by analytic differentiation Alder and Wainwright (1959); Rahman (1964); Verlet (1967); Rahman and Stillinger (1971); Daw and Baskes (1984); Tersoff (1988); Warshel et al. (2006); Jorgensen et al. (1983); Mahoney and Jorgensen (2000); Weiner and Kollman (1981); Brooks et al. (1983); Halgren (1996); Soares et al. (2005); Behler and Parrinello (2007); Behler et al. (2007); Behler (2011a, b); Jose et al. (2012); Behler (2016); Gastegger et al. (2017); Schütt et al. (2017, 2018); Bartók et al. (2013); Bartók and Csányi (2015); Botu and Ramprasad (2015); Rupp et al. (2015); Glielmo et al. (2017); Faber et al. (2018). This is represented in the next diagram:
[TABLE]
where the trained predictors and their post-training derived energy and forces are presented for the sGDML and an energy ML (E-ML) model, respectively. An interesting advantage of the sGDML over E-ML models is how the training error propagates to the derived quantities. Lets assume that the prediction errors associated with the models and are and , respectively. Then, from the discrete approximation of the integral and the derivative operator, we obtain that the error in the derived energy, , is attenuated and given by while the error in the derived forces, , is amplified and given by (see Supporting Information for further detains). A direct implication of these results is that, as a whole, FFs based on E-ML are potentially less stable than gradient based FFs. Empirical evidence supporting these results as well as a proof from signal processing theory were published in the original GDML paper Chmiela et al. (2017).
Regarding the data efficiency of the sGDML, there is solid evidence from Gaussian processes (GPs) that learning linearizations of a function, e.g. gradients, is more informative than learning single points Solak et al. (2003). Such data efficiency has been systematically shown in the GDML framework Chmiela et al. (2017, 2018a, 2018b). There is empirical evidence that more than training data points would be needed in an E-ML model per each sample used in GDML to reach similar force accuracy Chmiela et al. (2017). Therefore, allowing to train molecular FFs using data from very accurate but computationally expensive reference methods, e.g. CCSD(T).
III.2 Performance of using forces vs. forces+energies for training
In the process of generating ML-FFs, the nature of the model, E-ML or gradient domain model, gives prior information regarding the problem to solve. This in the context of GPs would be equivalent to narrow the space of possible solutions. Then, a loss function is introduced in order to train the model by finding the best set of parameters that minimize such function (here presented without the regularization part):
[TABLE]
Using these loss functions would, in principle, give an optimal fitting respectively. There is the idea that such loss functions can be complemented by adding energy or force constraints, as they are often available in the reference data. In fact, several related works optimize a hybrid squared loss function of the form Schütt et al. (2018):
[TABLE]
where is a linear trade-off hyper-parameter which absorbs the differences in units and weights the force and energy contribution. By training a model using this loss function a somewhat conflicting optimization race between energies and forces is introduced. Clear empirical evidence of this issue has been reported for NN-FFs Schütt et al. (2018) and for the sGDML+E model Chmiela et al. (2018b) where in both cases the quality of the forces degrades by introducing energy constrains.
III.3 Transferable and non-transferable models
Recently, the idea of transferable or across chemical compound space ML-FFs has been under discussion but due to its complexity less progress has been achieved compared to dedicated ML-FFs. Transferable models can generate qualitatively good predictions simultaneously for different molecular systems Hansen et al. (2015b); Ramakrishnan et al. (2015); Smith et al. (2017); Huang and von Lilienfeld (2017), but clearly cannot offer reliable results for PES reconstruction where energy prediction errors are often much larger than 1 kcal mol*-1* Smith et al. (2017). On the other hand, the accuracy achieved by state-of-the-art dedicated ML-FFs can even reach couple of calories per mole in energy predictions using only a few hundreds reference calculations for training Chmiela et al. (2018a). In contrast, the above mentioned transferable model required 13 million data configurations for training. Furthermore, any gradient field generated from these models is, at the moment, not reliable because of such energy errors, which makes prohibitive to accurately capture physical interaction Smith et al. (2017). From this discussion it is apparent that transferable models cannot be used to study dynamical properties of molecules, a task easily accomplished by dedicated FFs.
IV Molecular potential-energy surfaces
In the framework of the BO approximation, contains all the information necessary to describe the dynamics of a molecular system. All electronic quantum interactions are encoded in , but in practice it is not possible to expand in different energetic contributions such as hydrogen bonding, electrostatics, dispersion interactions or other electronic effects. Therefore, the intricate form of resulting from an interplay between different quantum interactions, should be preserved in the reconstruction process. We will now demonstrate how sGDML is able to describe many complex features contained in the quantum-chemical conformational data.
In practice, these important features or interactions (e.g. energy barriers or H-bond interactions) often come in the form of subtle variations in the of less than 0.1 kcal , which is one order of magnitude lower than so-called chemical accuracy Chmiela et al. (2018a). For example, the relative stability of trans and gauche conformers of ethanol is within 0.1 kcal . Any model with an expected error above that threshold runs the risk of misrepresenting or even inverting this subtle energy difference, which will lead to incorrect occupation probabilities and hence qualitatively wrong dynamical properties. The sGDML model has been shown to satisfy the stringent accuracy requirement of 0.1–0.2 kcal for molecules with up to 15 atoms Chmiela et al. (2018a). Moreover, we found that using coupled-cluster reference data not only generates a more accurate description of the quantum system, but also improves the learning errors as shown in Table 1. In this section, we exemplify the accuracy and insights obtained with sGDML with ubiquitous and challenging features of general interest in chemical physics: electron lone pairs, electrostatic interactions, intramolecular hydrogen bonds, proton tunneling effect, and other electronic effects (e.g. steric repulsion, change in the bond nature, and bonding–antibonding orbital interaction). Figure 2 shows an overview of the different types of molecules and their reconstructed PES chosen to highlight the mentioned effects in this study.
IV.1 Electrostatic interactions and electron lone pairs
First, we focus our attention on electrostatic interactions, including atom–atom and lone-pair–atom interaction. Here, the concept of electron lone pairs play a central role; these are ubiquitous molecular features responsible for a wide variety of physical and chemical phenomena. Lone pairs are valence electrons of an atom that are not shared with any other atom in a molecule. Some examples of atoms in molecules that often present lone pairs are nitrogen and oxygen. To illustrate the interactions induced by lone pairs, we will use the keto tautomer of malondialdehyde (keto–MDA) and ethanol molecule shown in the first two rows of Fig. 2. These fluxional molecules have complex PES with a rich variety of physical phenomena (e.g. electrostatics and steric repulsion) for which the reconstruction process is not a trivial task (see Fig. 3).
IV.1.1 Oxygen–oxygen atom repulsion in keto–MDA
To illustrate the interatomic repulsion interaction we use the keto–MDA molecule, whose PES complexity is evident despite its small size (see Fig. 3-A). For example, the PES contains flat regions, which correspond to the global minima of the molecule (depicted in dark blue in Fig. 3-A), but also display intricate pathways to move to local minima. Also, one can notice the sudden energy increase when the two oxygen atoms are in the closest configuration (structure (1) in Fig. 3-A). As mentioned before, the PES is the result of many complex interactions but certainly there are parts of the PES which can be mainly ascribed to a particular phenomenon. This is the case for the yellow region in Fig. 3-A, where the closeness between the two oxygen atoms suggests that the steep increase in the energy could be primarily attributed to the electrostatic repulsion between the lone pairs in each atom. Additionally, we know that electron lone pair clouds have large spatial extent compared to shared electrons, therefore steric effects caused by electron cloud overlap could also be playing an important role in this region due to the close proximity between the two oxygen atoms ( Å). From these two interactions, only the electrostatic contribution is roughly incorporated in regular FFs as constant point charges located on each atom. This greatly constrains their flexibility and reliability to describe complex interactions. Nonetheless, systematic studies of such interactions using the sGDML model could spawn new ideas regarding their integration into regular FFs and ultimately increase the predictive power of empirical FFs.
IV.1.2 Electron lone pairs in ethanol
A particularly interesting case of strong effects of electron lone pairs is the ethanol molecule, where the lone pairs of the oxygen atom interact with the partially positive hydrogen atoms of the methyl group (structure (2) in Fig. 3). This molecule has two rotors – the hydroxyl and methyl groups – as its main degrees of freedom. The PES for ethanol Fig. 3-B exhibits a very subtle quasi-linear dependence between the dihedral angles of the methyl and hydroxyl functional groups in the trans configuration (angle zero of the hydroxyl group in Fig. 3-B) as shown by the red arrows in Fig. 3-B. Such coupling is evident when analyzing the normal modes for this configuration, where the lowest two vibrations correspond to the aforementioned coupled motion (see also Fig. 4-C). The origin of this coupling can be understood by the electrostatic attraction between the lone pairs in the oxygen atom and the partially positively charged hydrogen atoms in the methyl rotor (structure (2) in Fig. 3). The correct description of this phenomenon within the reference data is crucial to obtain accurate physical properties of any molecular system with lone pairs Chmiela et al. (2018a). It is important to stress that an accurate and general description of lone pairs is a characteristic that goes beyond the capabilities of regular FFs Chmiela et al. (2018a). The handmade FFs that attempt to include lone pairs introduce explicit extra point charges Guillot (2002), which results in highly specialized models Oroguchi and Nakasako (2017).
IV.1.3 Dynamics of coupled rotors in ethanol
The coupling between the hydroxyl and methyl rotors in ethanol manifests in the two lowest normal modes (1) and (2) shown in Fig. 4-C. The normal mode (1) corresponds to the direction indicated by the red arrow in the PES (Fig. 3-B), while (2) moves perpendicularly. Here we analyze the dynamical implications of the coupling between lone pairs and the methyl rotor. By performing molecular dynamics simulations at 300K using the NVE and NVT ensembles, we have obtained different probability distributions and therefore different free energy surfaces (FES) as shown in Fig. 4-A. In both cases the FES considerably differs from its underlying PES in the trans region. The FESNVE develops a deep minimum which traps the system into a state that boosts the occupation of the lowest vibrational mode, illustrated by (1) in Fig. 4-C. In contrast, FESNVT reverts the direction of the coupling as depicted in Fig. 4-A-(2), which in general promotes the occupation of both vibrational modes (1) and (2) in Fig. 4-C. In both cases, the FES shows an interesting and contrasting behavior that highlights the importance of the coupling between the two rotors in ethanol. A possible explanation for the difference between the two ensembles is that the NVT distributes the energy between all the molecular degrees of freedom in a more efficient way compared to NVE. The NVE ensemble relies only on the anharmonicities of the PES to redistribute the energy. Furthermore, the strong coupling between the two rotors suppresses the energy redistribution between vibrational normal modes in NVE. It is clear that only FFs with an explicit description of lone pairs can show this coupling.
As shown here for ethanol and keto-MDA, the interactions involving electron lone pairs (e.g. electrostatic interactions, steric repulsion and interactions) encoded in the PES play a fundamental role in the dynamics of the molecule and defining the free energy (Fig. 2). This has direct implications on its thermodynamics and spectroscopic properties given that a different sampling of the PES re-weights and shifts the peaks in the vibrational spectrum. Therefore, since electron lone pairs participate in many other more complex interactions, their appropriate description is the first steps to generate reliable force fields.
IV.2 Intramolecular H–bond and proton transfer
Another complex phenomenon accurately captured with the sGDML method is hydrogen bonding (H–bond). The intramolecular H–bond is a subtle interaction, which dictates the dynamical behavior of many molecules Scheiner (2017). It is responsible for very important molecular features such as the molecular structure and vibrational spectra, which result in macroscopic properties e.g. solubility and permeability Kuhn et al. (2010). Here, we will study two different types of H–bonds: standard donor–acceptor H–bond and the symmetric H–bond present in salicylic acid and the enol form of malondialdehyde (enol-MDA), respectively (Fig. 5). The symmetric H–bond allows proton tunneling to occur due to thermal fluctuations assisted by quantum nuclear effects to overcome the energetic barrier. In the case of a standard donor–acceptor H–bond, the proton is fixed to the proton donor (PD) while its dynamical behavior is strongly affected by the electron lone pair belonging to a neighboring atom (proton acceptor, PA). These two kinds of intramolecular H–bond appear very often in molecules and their presence can drastically change the physical properties of any molecule, as we show in this section for salicylic acid and enol-MDA molecules.
IV.2.1 Intramolecular H–bond
We start by analyzing the salicylic acid molecule (Fig. 5-A). This molecule presents a standard donor–acceptor kind of H–bond between the hydroxyl and carboxylic acid groups. From the schematic representation in Fig. 5-A, we can see that the effect of the H–bond in this molecule consists in allowing the proton to stretch from the PD oxygen towards the PA oxygen. The middle point (transition state) between PD and PA is represented by a red plane in salicylic acid’s PES in Fig. 5-A bottom. The energy necessary to reach this point is 10 kcal , barely accessible at room temperature. We would also like to highlight the narrow structure of the reduced PES in the transition state (red plane in Fig. 5-A bottom), which gives us an idea of how directional the H–bond is. This directionality of the interaction changes the dynamics of the participating functional groups, which results in a characteristic red-shift in the stretching frequency of O–H induced by the H–bond Hobza (2002); Karpfen and Kryachko (2009); Wang et al. (2017). From a vibrational normal mode analysis on the sGDML reconstructed PES (see Fig. 6), we observe the red-shift of the O–H stretching frequency in the participating hydroxyl group. Furthermore, the H–bond also creates a blue shift in normal modes perpendicular to the H–bond (see Fig. 6), which is a direct evidence of a O–HO bond. The proper description of these molecular features will be directly displayed in spectroscopic properties such as IR and Raman spectrum which is often used to characterize molecular structures and their interactions.
IV.2.2 Proton transfer
The second type of H–bond we analyze is the symmetric H–bond in enol–MDA, which exhibits a symmetric double-well reduced PES as schematically represented in Fig. 5-B. The energetic barrier separating the two minima is 4 kcal which occurs when the interatomic distance between oxygen atoms is 2.38 Å, this allows proton transfer between the two oxygen atoms even at room temperature. The transition state in the PES is shown by the red plane in Fig. 5-B bottom.
The energetic barrier for the proton transfer has two possible contributions: Electron density rearrangement and quasi-aromaticity. From the schematic representation of the enol–MDA in Fig. 5-C, we see that going from the global minimum to the transition state entails a redistribution of the electron density. Therefore, the redistribution of electrons in the molecule induces an energetic penalty Goldsby and Chang (2015), which contributes to the generation of a higher energy barrier. There is also evidence that molecules like enol–MDA behave as quasi-aromatic systems in the transition state Martyniak et al. (2012). This phenomenon tends to stabilize the molecule in the transition state, which lowers the energetic barrier. Capturing these intricate and subtle quantum phenomena and their dynamics requires high-level quantum chemistry methods, with CCSD(T) being the only method that converges to the correct energetic barrier. We found that systematically increasing the amount of electron correlation energy in our calculations, the energy barrier decreases as 13 5 4 kcal for HF CCSD CCSD(T), respectively. This result highlights the importance of the correlation energy in such complex phenomena as the H-bond interaction and proton transfer.
The resulting free energy of the system obtained from MD simulations using sGDML@CCSD(T) at room temperature (see Fig. 2) displays a very low proton transition rate between the two oxygen atoms but it is still accessible at room temperature. This suggests that nuclear quantum effects would considerably increase the transition rate due to tunneling effects reshaping the FES and consequently its vibrational spectrum and thermodynamics. In general, the local electron density delocalization induced by intramolecular H-bonds influence macroscopic properties such as solubility and permeability Kuhn et al. (2010), but a fundamentally different macroscopic implication of the symmetric H-bond is proton transport in extended systems like water. Therefore, the need of creating FFs capable of handling H-bonds in all of their flavors to accurately describe complex biological systems becomes obvious, and data-driven models enable a robust solution to this problem.
IV.3 Hybridization change and interactions
The two type of interactions mentioned in the previous subsections, electrostatic and H–bonds, are often approximately implemented in empirical FFs. The advantages of flexible fully data-driven models are featured in describing any quantum interaction coming from , without relying on prior knowledge of the phenomena or its connection to any classical electrodynamic or mechanical concepts. To exemplify this, we consider the paracetamol and aspirin molecules. Their dynamics are strongly influenced by delocalized electrons and the interaction Newberry and Raines (2017), respectively. The quantitative description of these phenomena occurs naturally in our data-driven FF even when only a restricted amount of reference data is available.
Hence, it is important to highlight that only now with the development of accurate ML-FFs trained on ab-initio data it is feasible to study in full extent the dynamical implication of such electronic effects at finite temperatures.
IV.3.1 Hybridization state change in paracetamol
Paracetamol is a molecule with a shallow global minimum consisting in a planar configuration (Fig. 2) stabilized by being a conjugated system. From Fig. 2 for paracetamol, a steep energy increase is evident as illustrated by yellow regions in the PES. This represents the breaking of the conjugated state, given that the nitrogen atom changes its hybridization state from producing an energetic penalty (see electronic effects column in Fig. 2). Such electronic effects raise the energy, leading to an effectively inaccessible region in this direction of the PES. In fact, this region is hardly visited by MD simulations at 300 K, therefore it is not represented in the FES in Fig. 4-B.
Another important contribution to the planar structure of paracetamol is the electrostatic interaction between the lone pair on the carbonyl oxygen and the positively charged nearest hydrogen atom (see Fig. 4-D). We find a linear coupling between the acetamide main dihedral angle and the carbonyl dihedral angle, depicted in orange and black in Fig. 4-D, respectively. The projected FES in these two variables shown in Fig. 4-B-top, reveals that near the global minimum the system moves without altering the distance since the internal dihedral angle CNCO flexes to follow the minimum free energy path. Certainly, paracetamol is a highly fluxional molecule containing four correlated rotors moving in a complex PES due to its electronic structure.
IV.3.2 interaction in Aspirin
Another important electronic effect is the overlap between occupied (lone pair ) and antibonding () orbitals, this electronic effect is depicted in Fig. 2. The aspirin molecule is a particularly interesting case in this regard given the dominant role of this interaction in its molecular behaviour. This crucial attraction interaction is responsible for the binding between the ester and carbonyl groups, which dictates the structure of the global minimum. This effect is amplified at finite temperature given that thermal fluctuations enhance the overlap between the lone pair, , in the carbonyl group and the antibonding orbital in the ester group, Chmiela et al. (2018a). We have recently shown that the energy functional form of conventional FFs put into close contact four negatively charged oxygen atoms in aspirin; such strong charge repulsion leads to a misrepresentation of its PES Chmiela et al. (2018a). The sole incorporation of the missing lone pairs in all four closely interacting oxygen atoms and their directionality could greatly improve the results in regular FFs. In general, there are many other electronic effects (e.g. interactions Deepak and Sankararamakrishnan (2016), hyperconjugation, configuration dependent charge densities and Jahn–Teller effect Sarkar et al. (2017)) that are not explicitly incorporated in conventional FFs, nor captured by less robust ML-FF frameworks, which limits the reliability and predictive power of the dynamics. This rigorous requirement is justified by the increasing demand of computationally inexpensive and highly accurate PESs to interpret and obtain further insights into state-of-the-art spectroscopic experimental results Gruene et al. (2008); Romanescu et al. (2012); Balabin (2010); Ruiz-Santoyo et al. (2016); Davies et al. (2017); Gmerek et al. (2017); Sarkar et al. (2017).
In summary, we have analyzed a wide variety of energy landscapes reconstructed with high fidelity (see Fig. 2). Being trained directly on molecular forces from ab initio calculations, the generated PES encodes the broad range of fundamental interactions coming from the solution of the quantum-mechanical problem. This indicates that our model is a general ML approach capable of describing arbitrary interatomic interactions contained in the reference data.
V Conclusions
We have presented the construction of molecular force fields using the symmetrized Gradient Domain Machine Learning model. This framework reconstructs high-dimensional manifold embedded in the training data from a few 100s of samples, allowing the use of highly-accurate ab initio reference data such as the “gold standard” CCSD(T) method. The flexibility of the sGDML model comes from its intrinsic nature of being a fully data-driven universal approximator, which grants the adaptability to describe any kind of quantum interaction.
This was demonstrated here by describing H-bonds, proton transfer, lone pairs, changes in hybridization states, steric repulsion, and interactions and by obtaining insightful results from molecular dynamics simulations. From a careful analysis of the PES and MD simulations, we highlighted the importance of electron lone pairs in generating the strong coupling between the two rotors in ethanol and in the dynamics of keto-MDA. On the other hand, the proper description of H-bonds revealed the proton dynamics in salicylic acid and enol-MDA molecules, yielding further understanding about its implications on the vibrational spectrum. Regarding electronic effects, the main contribution is that ML-FFs can be used as trustworthy tools able to describe non-trivial interactions. For example, from MD simulations we observed the hybridization change of the nitrogen atom in paracetamol which help us to understand better the strong consequences of breaking the conjugated molecular system, and that in aspirin the interaction is enhanced at higher temperatures giving extra stability to the molecular global minima.
The main advantages of the sGDML model over other machine learning methods are copious: (i) it is highly data efficient, due to being trained in the gradient domain, (ii) it is robust due to modeling all atomic interactions globally, without any kind of inherent non-unique partitioning of the energy or force contributions, (iii) it uses energy conservation as a prior, therefore encoding this fundamental physical law in the core of every gradient-domain FF model, and (iv) it correctly represents spatial symmetries using explicit constraints that are automatically extracted from the data.
A number of challenges remain to be solved in order to extend the applicability of sGDML to larger systems. In spite of its many advantages, a global model imposes limits on the maximum molecule size, as well as the training set size. Overcoming this fundamental limitation without compromising its robustness calls for the introduction of a well-reasoned fragmentation scheme that divides the reconstruction problem into smaller independent subproblems without oversimplifying the nature of the interactions. A data-driven approach could achieve this task in a way that is tailored to preserving the intricate phenomena studied in this article, as opposed to applying general coarse-graining techniques. Such an approach will benefit from the explicit knowledge about fluxional symmetries within the system, which our algorithm is already able to extract. In its current formulation, the sGDML model captures different interaction scales, with no need to separating them. Nevertheless, an explicit decoupling of long-range interactions could be a new avenue to further increase data efficiency on the way to increasingly larger and complex molecules.
VI Supplementary Material
See supplementary material for additional tables with prediction accuracy for total energies and forces from DFT and CCSD(T) data. Also, some supplementary notes regarding reference data generation, molecular dynamics details, and error propagation.
The sGDML code and documentation is available at http://quantum-machine.org/gdml/.
VII ACKNOWLEDGEMENTS
We thank Dr. M. Gastegger for helpful discussions. S.C., A.T., and K.-R.M. thank the Deutsche Forschungsgemeinschaft (project MU 987/20-1) for funding this work. A.T. is funded by the European Research Council with ERC-CoG grant BeStMo. KRM acknowledges partial support by BMBF (BZML and BBDC) as well as by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (No. 2017-0-00451). Part of this research was performed while the authors were visiting the Institute for Pure and Applied Mathematics, which is supported by the NSF.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Alder and Wainwright (1959) B. J. Alder and T. E. Wainwright, J. Chem. Phys. 31 , 459 (1959) . · doi ↗
- 2Rahman (1964) A. Rahman, Phys. Rev. 136 , A 405 (1964) . · doi ↗
- 3Verlet (1967) L. Verlet, Phys. Rev. 159 , 98 (1967) . · doi ↗
- 4Rahman and Stillinger (1971) A. Rahman and F. H. Stillinger, J. Chem. Phys. 55 , 3336 (1971) . · doi ↗
- 5Daw and Baskes (1984) M. S. Daw and M. I. Baskes, Phys. Rev. B 29 , 6443 (1984) . · doi ↗
- 6Tersoff (1988) J. Tersoff, Phys. Rev. B 37 , 6991 (1988) . · doi ↗
- 7Warshel et al. (2006) A. Warshel, P. K. Sharma, M. Kato, and W. W. Parson, Biochim. Biophys. Acta, Proteins Proteomics 1764 , 1647 (2006) . · doi ↗
- 8Jorgensen et al. (1983) W. L. Jorgensen, J. Chandrasekhar, J. D. Madura, R. W. Impey, and M. L. Klein, J. Chem. Phys. 79 , 926 (1983) . · doi ↗
