Tensor Hypercontraction Error Correction Using Regression
Ishna Satyarth, Eric C. Larson, Devin A. Matthews

TL;DR
This paper uses machine learning to correct errors in a computationally efficient quantum chemistry method, improving accuracy for molecular energy predictions.
Contribution
The novel contribution is applying regression models to correct errors in tensor hypercontraction approximations of quantum methods.
Findings
Nonlinear regression models reduced THC-MP3 energy errors by up to 9× compared to canonical MP3.
Error corrections improved both molecule and reaction energy predictions significantly.
Absolute and relative correction approaches were evaluated for their effectiveness.
Abstract
Wavefunction‐based quantum methods are some of the most accurate tools for predicting and analyzing the electronic structure of molecules, in particular for accounting for dynamical electron correlation. However, most methods of including dynamical correlation beyond the simple second‐order Møller–Plesset perturbation theory (MP2) level are too computationally expensive to apply to large molecules. Approximations which reduce scaling with system size are a potential remedy, such as the tensor hyper‐contraction (THC) technique of Hohenstein et al., but also result in additional sources of error. In this work, we correct errors in THC‐approximated methods using machine learning. Specifically, we apply THC to third‐order Møller–Plesset theory (MP3) as a simplified model for coupled cluster with single and double excitations (CCSD), and train several regression models on observed THC errors…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
FIGURE 1
FIGURE 2
FIGURE 3| Model | Molecule | ΔMolecule | |||
|---|---|---|---|---|---|
| %IMP | RMSE | %IMP | RMSE | ||
| 1 | MP3b | — | 0.1337 | — | 0.1337 |
| SCS | 61% | 0.0517 | 61% | 0.0517 | |
| MLR | 78% | 0.0292 | 78% | 0.0292 | |
| KRR | 85% | 0.0206 |
|
| |
| 1.25 | MP3b | — | 0.0956 | — | 0.0956 |
| SCS | 66% | 0.0323 | 66% | 0.0323 | |
| MLR | 83% | 0.0159 | 83% | 0.0159 | |
| KRR |
|
| 87% | 0.0128 | |
| 1.5 | MP3b | — | 0.0573 | — | 0.0573 |
| SCS | 68% | 0.0183 | 68% | 0.0183 | |
| MLR | 84% | 0.0091 | 84% | 0.0091 | |
| KRR |
|
| 84% | 0.0092 | |
| 1.75 | MP3b | — | 0.0316 | — | 0.0316 |
| SCS | 66% | 0.0106 | 66% | 0.0106 | |
| MLR | 83% | 0.0053 | 83% | 0.0053 | |
| KRR |
|
| 84% | 0.0051 | |
| 2 | MP3b | — | 0.0190 | — | 0.0190 |
| SCS | 58% | 0.0079 | 58% | 0.0079 | |
| MLR | 83% | 0.0032 | 83% | 0.0032 | |
| KRR |
|
|
|
| |
| Model | %IMP | RMSE | %IMP | MAE | %IMP | MAPE | |
|---|---|---|---|---|---|---|---|
| 1 | MP3b | — | 1.5711 | — | 0.9374 | — | 27.51% |
| R | 38% | 0.9835 | 29% | 0.6625 | 34% | 18.14% | |
| ΔR |
|
|
|
|
|
| |
| 1.25 | MP3b | — | 0.8869 | — | 0.4853 | — | 11.33% |
| R | 55% | 0.4050 | 45% | 0.2675 | 27% | 8.31% | |
| ΔR |
|
|
|
|
|
| |
| 1.5 | MP3b | — | 0.5975 | — | 0.2998 | — | 10.77% |
| R | 51% | 0.2960 | 37% | 0.1894 | 48% | 5.63% | |
| ΔR | 53% | 0.2828 | 36% | 0.1915 | 36% | 6.93% | |
| 1.75 | MP3b | — | 0.4036 | — | 0.1943 | — | 4.35% |
| R | 51% | 0.1969 | 40% | 0.1174 | 13% | 3.77% | |
| ΔR | 51% | 0.1981 | 39% | 0.1182 | 12% | 3.81% | |
| 2 | MP3b | — | 0.2814 | — | 0.1327 | — | 2.81% |
| R |
|
|
|
|
|
| |
| ΔR | 53% | 0.1322 | 45% | 0.0736 | 41% | 1.66% |
- —National Science Foundation10.13039/100000001
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Tensor decomposition and applications · Quantum many-body systems
Introduction
1
The calculation of accurate electronic energies for atoms, molecules, and extended systems is a cornerstone activity in chemical physics and physical chemistry. A quantitative description of chemical processes such as bond breaking and formation, interaction with light due to transitions to excited states, and interaction of electron/nuclear spin and external magnetic fields all require a quantum treatment of the electronic degrees of freedom and consequently, complex and time‐consuming calculations of the electronic structure and energy. Among the most accurate electronic structure techniques are wavefunction‐based methods, such as perturbation theory (Møller–Plesset or many‐body), configuration interaction (CI), and coupled cluster (CC), among many related techniques. These methods all have the common feature of steep computational scaling with system size: apart from second‐order Møller–Plesset perturbation theory (MP2) and similar approximations to CI or CC, such methods scale as at least 𝒪(N6) where N is a measure of system size [1, 2, 3].
A number of techniques to reduce the scaling of such methods through controlled approximations have been proposed over the past several decades, primarily focusing on either a localization of the molecular orbital space and exploitation of the resulting sparse structure of the Hamiltonian and wavefunction, for example the domain‐localized pair natural orbital family methods [4, 5, 6], and/or via a divide‐and‐conquer technique utilizing decomposition of the system either in real space (e.g., FMO and XSAPT) [7, 8, 9] or in Hilbert space (LNO‐CIM, MBBE, DEC) [10, 11, 12, 13]. Alternatively, several methods have been proposed to reduce scaling via tensor factorization, whereby a global “low‐rank” decomposition of the Hamiltonian and/or wavefunction results in lowered computational scaling [9, 14, 15, 16, 17, 18]. Of course, each of these techniques necessarily introduces errors into the electronic energy and other properties. In practice, the magnitudes of the reduction in compute time and the energy error must be carefully balanced in order to obtain meaningful results on larger systems of interest.
In this work, we focus on the least squares tensor hypercontraction (LS‐THC) factorization, introduced by Hohenstein et al. [14, 15, 19, 20, 21, 22, 23, 24, 25], for which a number of specific methods have been developed. Song and Martinez [26] combined scaled opposite‐spin second‐order Møller–Plesset perturbation theory (SOS‐MP2) with THC and graphical processing units (GPUs) for efficient quantum chemistry on large molecules like proteins. These methods reduced computational scaling from quartic (fourth degree) to near‐cubic (third degree) or even linear for large systems, enabling accurate, fast calculations for complex biological systems. Matthews [25] discusses the least‐squares tensor hypercontraction method for third‐order Møller–Plesset perturbation theory (MP3), demonstrating promising accuracy and efficiency compared to standard density fitting. Lee et al. [24] also discuss LS‐THC‐MP3 in the context of grids automatically generated using a weighted centroidal Voronoi partitioning. More recently, Hohenstein et al. [27] report a full nonlinear THC method (i.e., where collocation matrices are determined via direct nonlinear optimization) for coupled cluster with singles and doubles (CCSD) with excellent accuracy and efficiency via multi‐GPU acceleration. This latter work demonstrates that the larger errors encountered in LS‐THC‐MP3 [24, 25] are not a fundamental feature of THC but an artifact of the least squares scheme, suggesting potential improvement via external correction.
Machine learning is one such powerful technique which can be used to reduce errors in computational chemistry methods. McGibbon et al. [28] introduced spin‐network‐scaled MP2 (SNS‐MP2) for dimer interaction energies, achieving accuracy comparable to benchmark methods using neural networks. Notably, Behler [29] employed feed‐forward neural networks (NNs) to analyze the potential energy solution, highlighting the importance of careful training for effective results. In another paper [30], Behler present a high‐dimensional NN based approach to evaluate potential energy surfaces by converting Cartesian coordinates onto a set of symmetry functions. Subsequent studies, including Schütt [31] and Gilmer et al. [32], explored deep tensor neural networks and message passing neural networks, respectively, to model molecular properties and enhance computational efficiency, however these were studied for smaller molecules only.
Here, we leverage regression techniques in order to machine learn the errors which result from the least‐squares tensor hypercontraction approximation to third‐order Møller–Plesset perturbation theory (LS‐THC‐MP3). We focus on MP3 for several reasons: first, the MP3 energy may be conveniently divided into multiple physically‐motivated components which can serve as independent features for machine learning (discussed further in the next section). Second, MP3 captures the majority of the necessary physics from more complex models such as coupled cluster with single and double excitations (CCSD), while being computationally simpler and more easily analyzable [15, 19, 25, 33]. Third, we have already studied the theoretical properties of LS‐THC‐MP3 in depth [25], in particular the relatively more inaccurate approximation of the first‐ or second‐order wavefunction via LS‐THC. The larger magnitude of these errors compared to methods which only require a decomposition of the two electron integrals [25, 34] makes LS‐THC‐MP3 a prime target for machine learning correction and a gateway to more general corrections of LS‐THC energies.
The main contributions of this work are:
- We apply LS‐THC approximation methods to MP3 using the extensive MGCDB84 database [35] as a training set.
- We correct the error due to LS‐THC in MP3 using linear and nonlinear regression techniques.
- We compare the performance of multiple linear regression (MLR) to nonlinear kernel ridge regression (KRR) models across a range of application conditions and model scenarios.
- We analyze the role of absolute and relative error correction in regression fitting of molecular and reaction energies.
Theory
2
We start by reviewing the basic theory of third‐order Møller–Plesset theory and least squares tensor hypercontraction, then explain how the errors are induced due to approximations in LS‐THC over MP3. We will then introduce our concept and approach towards application of machine learning correction of LS‐THC‐MP3.
Third‐Order Møller–Plesset Theory
2.1
The total MP3 electronic energy for closed‐shell systems, which includes the reference self‐consistent field (SCF) and MP2 energies is given by [1],
Here, gqspr and t[1]ijab are elements of the two‐electron integral tensor (the two‐particle part of the normal‐ordered Hamiltonian) and the first‐order wavefunction amplitudes, respectively,
Specifically, the closed‐shell formulation uses only the “opposite‐spin” two‐electron integrals and amplitudes. For a canonical reference determinant, the first‐order amplitudes are,
where ϵp are the orbital energies. The components EC and EX of the MP2 correlation energy are related to the opposite‐spin and same‐spin contributions used in spin‐component‐scaled MP2 (SCS‐MP2) [36] by EOS=12EC and ESS=12EC+EX. In the MP3 correlation energy, the 10 energy components En play a similar role. We can identify these components with distinct Goldstone diagrams [25] as in Figure 1.
Goldstone diagrammatic depiction of the 10 components of the MP3 energy. Solid horizontal lines denote first‐order amplitudes while dashed lines denote two‐electron integrals. Note that E9 and E10 are each the sum of two Hermitian conjugate diagrams.
Least‐Squares Tensor Hypercontraction
2.2
The LS‐THC approximation of MP2 and MP3 proceeds by the factorization of grspq and t[1]ijab into (hyper)products of five matrices [14, 15],
The collocation matrix XpP is obtained by evaluating molecular orbitals ψp at predetermined grid points xP, and is shared between both the integral and amplitude factorizations. These grid points are either predefined in analogy to molecular basis sets [37], or determined by pruning a larger parent grid [23, 24]. Only the least squares variant of THC is explored in this work, and so “THC” is taken to mean “LS‐THC” in the remainder of the manuscript. In our implementation, distinct sets of grid points are used for pairs of molecular orbitals with different occupancy, that is, virtual‐virtual, virtual‐occupied, and occupied‐occupied. A least‐squares fit is used to determined the orbital pair‐specific core matrices VPQ(pq)(rs) from density‐fitted (DF) integrals, [23]
and similarly for core matrix TPQ[1] through matrix‐free fitting of the denominator‐weighted THC‐approximated integrals [25]. Because the amplitudes and integrals are no longer related exactly via (6) after THC approximation, the form of (1) as written suggests the “THC‐MP2b” and “THC‐MP3b” variants, which we further shorten to simply “MP2b” and “MP3b” as these automatically imply THC [25]. In practice, we compute both the MP2a (approximation of only gqspr) and MP2b (approximation of both integrals and amplitudes) energies, and both MP3b and MP3d energies, where the latter involves an LS‐THC fit of the second‐order wavefunction amplitudes. We will denote these specific energies as En,b, En,d, etc.
In order to apply machine learning techniques, we introduce a modified MP3 energy,
where all cn=1 for canonical MP3. Likewise, the THC‐approximated energy may be expanded in terms of individually approximated energy components, for example,
which is potentially close to EMP3 for some choice of c. This construction is quite similar to SCS‐MP2 [36], in that each energy component is assigned an independent coefficient. The use of such a modified energy formula is critical for use in nonlinear machine learning in that the coefficients cn are pseudo‐bounded (typically between 0 and 1, and specific constraints can be enforced), while the MP3 energy itself is scale‐dependent and potentially of very different magnitude even in similarly‐sized systems. Additionally, in the context of the THC approximation (which necessarily introduces some error in the MP3 energy), we can express the parameterized THC‐MP3 energy as an energy difference,
where cn′=cn−1.
Machine Learning
2.3
We applied two regression techniques to determine optimal coefficients cn in either a global or a system‐dependent way: multiple linear regression (MLR) and kernel ridge regression (KRR).
Linear and Nonlinear Regression
2.3.1
In the regression approach, the free parameters C of a model function ϕ(x;C) are determined in order to minimize an objective (or loss) function, typically the mean squared variation between the model function and the fitting data (labels) Y={y1,y2...yN}
for N observable sets of K features, M prediction features, and K′ fitting parameters per prediction feature. The simplest form is multiple linear regression [38], which takes the functional form,
For example, the SCS‐MP2 method [36] may be considered a multiple linear regression of K=K′=2 features with zero y‐intercept.
One version of nonlinear regression uses kernels to calculate vector similarity in a high dimensional space. For example the radial basis function (RBF) or Gaussian kernel represents an exponential transform of the dot product ⟨χi, χk⟩ for distributions χi and χk in an infinite dimensional space,
This means we can consider such a model to have the same number of parameters as observable kernel dimensions, even though K′=N. Thus, the model is considered to have a continuous (infinite‐dimensional) fitting space parameterized by the shape of the kernel 𝒦(xi,xj). The dimensionality of the RBF model may lead to over‐fitting, as well as to an ill‐conditioned solution, but provides a powerful capability to model nonlinear relationships. The model can also be amended with ridge (also called L2 or Tikhonov) regularization by adding a penalty α∑iN∑jMcjiŷij to the objective function. The combination of these approaches leads to the well‐known kernel ridge regression (KRR) technique [38]. The parameters γ and α are not easily determined via closed‐form solution as are the parameters C; instead, these are treated as hyper‐parameters and estimated via an additional optimization procedure such as grid search and Nelder‐Mead downhill simplex optimization, vide infra.
There are other nonlinear, kernel‐based regression techniques such as support vector regression (SVR) and Gaussian process regression (GPR), but they differ in their underlying principles and approaches. We focus on KRR in this work as it is highly efficient and typically as performant as SVR and GPR [39, 40, 41, 42].
Feature Scaling
2.3.2
All models discussed above benefit from transformation of the observable features X and labels y into a well‐defined distribution, typically unit normal [43, 44]. This process of feature scaling serves to equalize the influence of different features on the objective function and to better reflect the assumptions of the KRR model. While training is done in the scaled feature (and label) space, all reported fit values are in the unscaled space, for example, with the original units of energy. Importantly, any new data (beyond the original training set) must be scaled to the same distribution before applying the regression model, making the scaling parameters also part of the model in practice.
Computational Details
3
Data Set
3.1
As a training/test set, we use a subset of the Main Group Chemistry Database (MGCDB84) [35] consisting of all closed‐shell systems composed of elements hydrogen through fluorine. We focus only on this subset of MGCDB84 due to the lack of extant THC benchmarks on open‐shell and third period elements, and in particular studies of necessary grid size and quality for heavier elements. We recently developed an extension of THC‐MP3 to open‐shell systems [45], but have only performed limited benchmarks thus far. In total, this data set contains 4370 species and 2680 reactions. For each species, we computed canonical Hartree–Fock (HF), DF‐MP2 and ‐MP3 and approximate MP2a, MP2b, MP3b, and MP3d energies (with all energies broken down into distinct diagrammatic contributions). THC calculations were performed starting with an SG‐1 parent grid [46], pruned via pivoted Cholesky factorization according to a tolerance parameter 10−δ [23]. Smaller δ values signify more inaccuracies and hence larger THC errors and vice versa, whereas smaller δ values also indicate a lower computational cost of the calculation. The cc‐pVDZ orbital and cc‐pVDZ‐RI auxiliary basis sets as well as the frozen core approximation were used throughout [23, 47, 48, 49, 50]. We computed THC data for δ=1, 1.25, 1.5, 1.75, and 2. All calculations used a development version of the CFOUR program package [51].
We also computed additional system‐specific molecular features that affect the canonical MP3 energy and THC‐MP3 energy calculations: HOMO‐LUMO gap and HUMO‐LOMO separation (total orbital eigenvalue span); THC goodness‐of‐fit measures fpq=1−‖BqJp−BqJ,THCq‖F/‖BqJp‖F for pq=ab, ai, and ij; and norms ‖gijab‖X and ‖t[1]ijab‖X for X=F,∞ (note that the ∞‐norm is computed as a vector norm, that is, the maximum absolute element). This brings the total input features to 34 (ten components each of two linearly independent variants of THC: MP3b and MP3d, two components each of two variants of MP2: MP2a and MP2b, nine system‐specific molecular features and one Hartree Fock energy). These additional features capture variation in the magnitude and type of interactions present in the Hamiltonian, as well as some measure of how well THC approximates the Hamiltonian.
Energy values and g/t[1] Frobenius norms were also normalized by dividing by the number of valence electrons to account for molecules of varying size. As the Frobenius norm is invariant to basis set rotation, we can see that by rotation to a localized basis the norm of the two‐electron integrals must scale linearly with molecular size in the asymptotic limit, and a similar argument can be made for the first‐order amplitudes. Deviation from asymptotic behavior for small molecules provides a nonlinear feature dimension which should aid in fitting. Finally, we rescale the THC fit parameters as f˜pq=log(1−fpq) and apply standard unit normal scaling to all feature dimensions (both training and inference) as well as the reference (canonical MP3) values for training only.
From this data we form four distinct label sets for training and/or evaluation. The molecule set simply contains the canonical MP3 energies for each of the 4370 species in the data set. The Δmolecule set instead tabulates the THC error directly as ΔEMP3=EMP3−EMP3b. For the reaction set, we combine MP3 reference energies according to the 2680 tabulated reaction schemes in our subset of MGCDB84. Here, we do not normalize the individual energies by the number of valence electrons, but rather divide the computed reaction energy ΔErxn,MP3 by the total number of reactants and products, that is, the sum of the absolute values of the stoichiometric numbers ∑i=1nspecies|νi|, for training only. Finally, the Δ reaction set combines the Δ molecule and reaction approaches by computing reference values which are THC reaction energy errors ΔΔErxn,MP3b=ΔErxn,MP3−ΔErxn,MP3b.
To summarize:
- molecule: Directly predict MP3 energies for each molecule based on 34 features.
- Δ molecule: Predict ΔEMP3 from 34 features per molecule.
- reaction: Predict ΔErxn,MP3 using the molecular energies.
- Δ reaction: Predict ΔΔErxn,MP3 using the errors in molecular energies.
We performed 10‐fold cross‐validation by splitting the data into 10 roughly equal‐sized sets or folds and training 10 independent models, holding one fold out as testing data in each case. The variation of the testing losses across folds then gives a measure of consistency within the data set and the mean of the losses gives a robust performance estimate. This serves two purposes: to verify that the model has not significantly overfit the training data (i.e., that it is generalizable) and that the distribution of the data does not contain significant outliers.
Regression Models
3.2
We applied both MLR (molecule data only) and KRR [both (Δ)molecule and (Δ)reaction] regression, the latter with an RBF kernel. Models for the reaction and Δ reaction sets were first trained on the corresponding molecule and Δ molecule sets and then evaluated by sequentially predicting total (EMP3) or relative (ΔEMP3) energies for each species and then combining with the stoichiometric numbers to form a reaction energy prediction. Thus, the quality of the reaction energy predictions will depend on the extent of error cancellation retained or introduced in the molecular energies. The KRR hyper‐parameters were tuned by initial grid search followed by Nelder‐Mead downhill simplex optimization [52]. We employed ten‐fold cross‐validation for both hyper‐parameter search and final evaluation, where the average of the ten losses is reported.
In addition to training models with the feature set and standard scaling described above, we also trained MLR models with a reduced feature set consisting only of the 10 MP3b energy components, without feature scaling, and with a zero y‐intercept. This model mirrors the standard SCS‐MP2 [36] in that the original energy is simply modified component‐wise in order to improve accuracy, and will be known as scs‐molecule. Although, in this work we only target regression of the THC energy correction and not experimental or high‐level computational benchmarks. Note that the scs‐Δmolecule and mlr‐Δmolecule models are equivalent to the corresponding molecule models via (14).
Results and Discussion
4
In this section, we will discuss the results of linear and nonlinear regression techniques. Errors for molecule and Δmolecule data are reported in kcal·mol−1·e−1, and in kcal·mol−1 for reaction and Δreaction. As mentioned earlier, lower δ values (∼1) produce intrinsically less accurate THC factorizations compared to larger values of δ (∼2), which also affects the overall accuracy and accuracy improvement of the machine learned models.
Evaluation Criteria
4.1
Statistical errors over the test set are reported as root mean squared errors (RMSE) for all data sets, and additionally as mean absolute errors (MAE) and mean absolute percent errors (MAPE) for reaction and Δ reaction,
In order to compare machine learned models to the uncorrected THC‐MP3b approximation, we also compute fractional improvement scores as,
for X=RMSE, MAE, or MAPE. As the test set contains a wide variety of reactions/interactions, ranging from low‐energy conformational rearrangements to high‐energy bond dissociation and cluster binding energies, consideration of multiple error measures is critical to a full understanding of model performance.
molecule and Δmolecule
4.2
In Table 1, RMSE and fractional improvements (%IMP) are reported for the SCS, MLR, and KRR models on the molecule and Δmolecule dat sets. For δ=1, we see that the original MP3b RMSE of 0.1337 kcal·mol−1·e−1 was reduced to 0.0517 kcal·mol−1·e−1 using scs‐molecule which is a 62% improvement over MP3b. This error was further reduced to 0.0292 kcal·mol−1·e−1 using mlr‐molecule which is a 78% improvement over MP3b. krr‐molecule reduced this error to 0.0206 kcal·mol−1·e−1 which is a 85% improvement and krr‐Δmolecule reduced the error further to 0.0153 kcal·mol−1·e−1 which is a 89% improvement over RMSE of MP3b. For other values of δ (=1.25, 1.5, 1.75, and 2), we see similar trend for scs‐molecule, mlr‐molecule, and krr‐molecule, while krr‐Δmolecule becomes slightly less effective for larger δ values.
Considering the different machine learning models employed, the scs‐molecule model reduces the overall MP3b errors by approximately 60%–70%, with consistent improvement even for larger δ values. While medium values of δ achieve a larger improvement in these tests this is likely not a significant phenomenon. As with SCS‐MP2, a “straightforward” multiple linear regression model can directly correct the errors in THC‐MP3b with respect to the canonical MP3 ground state energies. From Table 1, we can see that employing a SCS correction with δ=1 reduces errors to a similar level as for uncorrected THC‐MP3b with δ=1.5. Based on the performance scaling results of Reference [45], a reduction of δ from 1.5 to 1 corresponds to approximately a factor of 3 reduction in computational effort.
Next we can compare the RMSE results of mlr‐molecule, which uses 34 scaled input features, to scs‐molecule, which uses only 10 unscaled THC components as input features. It is evident that the RMSE of mlr‐molecule is almost half that of scs‐molecule, and almost one‐sixth of MP3b errors across all values of δ, signifying that mlr‐molecule outperforms scs‐molecule when correcting the THC errors with respect to canonical MP3 values. Inclusion of additional physical features is thus an effective way to extend SCS‐like approaches with low or no computational overhead. This approach could be used for novel applications of SCS, such as the present application to MP3 or other methods with distinct diagrammatic contributions, or as a modification of existing SCS‐MP2 approaches. The fractional improvement for MLR is slightly lower at δ=1 than for higher δ values (78% vs. ∼83%). While the sample size is not large enough to show statistical significance, this might indicate a more significant nonlinearity of the feature space at this approximation level, which we discuss further below in the context of KRR.
As a first step to implement KRR, we have used grid search to identify the KRR hyper‐parameters α and γ for which the KRR performs best using an initial grid search followed by simplex optimization. An example of the grid search results are given in Figure 2. For δ=1, we can see that the maximum fractional improvement of 84% at three [α, γ] combinations–[1e‐09, 1e‐07], [1e‐07, 1e‐06] and [1e‐05, 1e‐05]. Nelder‐Mead downhill simplex optimization is performed starting at point [1e‐07, 1e‐06] with α range [1e‐09, 1e‐05] and γ range [1e‐07, 1e‐05]. The major trend in the grid search results, depicted in full in Figure S1, shows that the best fit for krr‐molecule is possible with very small α and γ values, which is the upper left corner of the heatmap, indicating a tight, quasi‐linear fit in the molecule dataset (the limit α,γ→0 corresponds to the trivial prediction ŷ=0). However, for krr‐Δmolecule with δ=1 (Figures 2 and S2), a separate maximum is observed near γ=0.01 and α=0.001, indicating significant nonlinearity in the Δmolecule dataset which can be exploited by the RBF basis. See Table S1 in the Supporting Information for the final optimized hyper‐parameters.
RMSE %IMP for krr‐molecule and krr‐Δmolecule model with δ=1 as a function of hyper‐parameters α (regularization strength) and γ (inverse length scale).
From Table 1, we observe that for all the δ values, the nonlinear models (krr‐molecule and krr‐Δmolecule) have outperformed the linear models (mlr‐molecule and mlr‐Δmolecule), indicating the presence of at least some kernel nonlinearity in feature space. Across all values of δ, krr‐molecule and krr‐Δmolecule achieve between 6‐ and 9‐fold reduction in RMSE over MP3b. krr‐molecule and krr‐Δmolecule perform very similarly with the exception of δ=1. The higher improvement for krr‐Δmolecule here highlights a larger degree of nonlinearity in that particular dataset. As the transformation from absolute to relative errors should remove large, consistent contributions which map linearly onto input features, larger nonlinearity in the error residuals is to be expected. Instead, the lower degree of nonlinearity in the data for larger δ values more likely stems from an increase in incoherent noise as the errors approach zero. Compared to SCS, KRR reduces the error at δ=1 to a similar level as uncorrected MP3b at δ=2. The enhanced corrective power of KRR could then lead to an approximately order of magnitude reduction in required computational time for a similar level of accuracy in absolute molecular energies. While the features used as inputs to the regression model do incur an additional computation cost, an optimized implementation of feature computation (especially the fit parameters fpq) would render this cost negligible. In our unoptimized implementation the extra cost is less than 30% of the total time.
Figure 3 reproduces the RMSE data from Table 1 in graphical form, along with “whiskers” representing the standard deviation across different validation sets. we can compare the RMSE of MP3b, scs‐molecule, mlr‐molecule, mlr‐Δmolecule, krr‐molecule, and krr‐Δmolecule for different δ values. Longer whiskers indicate that the RMSE from 10 fold validation have larger variation and smaller whiskers indicate that the 10 RMSE results are tightly clustered around the mean, suggesting a more reliable and stable model performance across data points. Larger standard deviations more likely represent variability in the dataset rather than overfitting, as overfitting (especially in the KRR model) would manifest as poor test set predictions for all folds, lowering the average improvement as well.
Comparative analysis of absolute energy RMSE for all models. Whiskers denote one standard deviation across the ten cross‐validation sets.
We also observe that although scs‐molecule has reduced the MP3b errors by almost half, the variation of prediction errors are very large; this variation in the RMSE has been significantly reduced using higher dimensional and scaled dataset using mlr‐molecule and mlr‐Δmolecule. The variation in krr‐molecule is smaller as compared to krr‐Δmolecule, presumably due to a nonuniform distribution of the nonlinearity in the Δmolecule dataset. However, for δ=1.25, we see an abrupt change in behavior in the standard deviations: the variation of the nonlinear models (KRR) is higher than that of the linear model (MLR), indicating the presence of outliers or uneven fit quality. The 1‐σ ranges of different models for the molecule and Δmolecule datasets overlap, indicating similar overall statistical performance. For example in case of mlr‐molecule and mlr‐Δmolecule for all δ values, and krr‐molecule and krr‐Δmolecule for δ=2. It is evident from Figure 3, that the nonlinear regression models have reduced the THC‐MP3b errors significantly for all δ values.
reaction and Δreaction
4.3
Table 2 presents RMSE, MAE, and MAPE results for the krr‐reaction and krr‐Δreaction models. For δ=1, the RMSE of MP3b is 1.5711 kcal·mol−1, which was reduced by krr‐reaction to 0.09835 kcal·mol−1 which is a 38% improvement and by krr‐Δreaction to 0.05732 kcal·mol−1 which is a 64% improvement. It is clear that using krr‐reaction and krr‐Δreaction, the reduced THC errors from molecular energies were able to yield reduced THC errors in reaction energies from corrections to isolated molecular energies alone, although the extent of improvement is somewhat less.
We also see that in terms of RMSE, krr‐Δreaction outperforms krr‐reaction for δ=1 and 1.25, whereas both models perform similarly for δ=1.5 and 1.75, and the relationship reverses for δ=2. We see that krr‐reaction has performance improvement ranging between 38% for δ=1 and 57% for δ=2, and krr‐Δreaction has performance improvement ranging between 51% for δ=1.75 and 65% for δ=1.25, with all other δ between these ranges. We also see that the performance of both the models “peak” at δ=1.25. Also, for δ=1.5, 1.75, and 2, the RMSE of both the models has marginal difference. Hence we will also look for how both the models perform on an absolute scale. The better performance of krr‐Δreaction for small δ values highlights that for loose THC approximations, the larger scale of the energy corrections offers more scope of improvement through nonlinear fitting, along with a potential increase in nonlinearity due to cancellation between molecular energies. As with the molecule data set, KRR provides the most relative improvement for large initial errors. However, the larger improvement of the KRR model for molecular energies compared to reaction energies points to inadequate error cancellation in the krr‐molecule models. Even though it uses physical features, the KRR model is not physics‐based and so cannot “anticipate” the necessary cancellation required to most accurately predict residual reaction or interaction energies. Instead, the KRR model introduces more incoherent random errors which do not reliably cancel.
We see that for all δ values, the RMSE are consistent with MAE and MAPE, except for δ=1.75. In this case, the MP3b MAPE reduces suddenly from 10.77% to 4.35%, while the KRR models show a smoother decrease in MAPE with increasing δ. However, with the exception of this point, the MAE and MAPE fractional improvement of both the models are roughly comparable across all values of δ and obtain a 2–3× reduction in error for predicted reaction energies.
Conclusions
5
From the above results, we can conclude that all tested machine learning models provide significant reduction in LS‐THC‐MP3b errors at little to no additional computational cost. Multiple linear regression based on a spin‐component‐scaling approach is effective, yielding an improvement factor of 58%–68% over MP3b, but introducing additional physical input features further improved to 78%–84%. The significant effect of introducing additional energy‐based and nonenergy‐based input features suggests that “traditional” SCS approaches could also be improved, and that other methods admitting a diagrammatic decomposition could be corrected using an SCS‐like scheme.
However, introducing a nonlinear regression model with KRR reduced the THC errors further, indicating nonlinearity present in the molecule and Δmolecule data sets, that was exploitable using a radial basis function basis. The performance of krr‐molecule is between 85% and 87%, and for krr‐Δmolecule is in the range 84%–89%. The performance of both the models varied slightly depending on δ values, with krr‐Δmolecule achieving the best performance at δ=1, which corresponds to the loosest THC approximation. These results highlight the ability of KRR to more accurately correct large, nonlinear errors in comparison to MLR.
We also saw that by training regression models to correct THC errors in molecules (the molecule data set), we can also reduce THC errors in derived reaction and interaction energies (the reaction data set) utilizing the concept of error cancellation between reactants and products in a reaction. The performance of krr‐reaction ranges from 38% to 57% improvement over MP3b and krr‐Δreaction ranges from 51% to 65% improvement over MP3b. The performance of both the models varied depending on δ values. The lower improvement of reaction energies compared to molecular energies highlights the random distribution of errors produced by KRR which cannot exploit physical similarity of reactants and products.
While this study focused only on the closed‐shell, second period subset of MGCDB84, and models trained on this subset are not likely to generalize to the wider data set, our observations support the conclusion that retraining of regression models on more diverse systems is likely to also result in substantial reduction of THC errors. Importantly, the relative quality of grids for light and heavy elements must be balanced in order for the grid size parameter δ to be meaningful.
In summary, we show that simple regression models are highly effective in correct errors inherent in the LS‐THC approximation, with nonlinear kernel ridge regression providing the most improvement, more than 84% for total molecular energies, but for the reaction data set, the improvement was less than 65%. This situation highlights both the promise of regression for overcoming the accuracy limitations of LS‐THC, but also the limitations of the regression method itself for studying subtle changes in reaction energies which depend on error cancellation.
Conflicts of Interest
The authors declare no conflicts of interest.
Supporting information
Data S1. Supporting Information.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1I. Shavitt and R. Bartlett , Many‐Body Methods in Chemistry and Physics: MBPT and Coupled‐Cluster Theory, Cambridge Molecular Science (Cambridge University Press, 2009).
- 2T. Helgaker , P. Jorgensen , and J. Olsen , Molecular Electronic‐Structure Theory (Wiley, 2013).
- 3B. Jansík , S. Host , P. Jorgensen , J. Olsen , and T. Helgaker , Linear‐Scaling Symmetric Square‐Root Decomposition of the Overlap Matrix (John Wiley & Sons, 2013).10.1063/1.270988117411105 · doi ↗ · pubmed ↗
- 4C. Riplinger and F. Neese , “An Efficient and Near Linear Scaling Pair Natural Orbital Based Local Coupled Cluster Method,” Journal of Chemical Physics 138 (2013): 034106.23343267 10.1063/1.4773581 · doi ↗ · pubmed ↗
- 5Y. Guo , C. Riplinger , U. Becker , et al., “Communication: An Improved Linear Scaling Perturbative Triples Correction for the Domain Based Local Pair‐Natural Orbital Based Singles and Doubles Coupled Cluster Method [DLPNO‐CCSD(T)],” Journal of Chemical Physics 148 (2018): 011101.29306283 10.1063/1.5011798 · doi ↗ · pubmed ↗
- 6P. Pinski and F. Neese , “Communication: Exact Analytical Derivatives for the Domain‐Based Local Pair Natural Orbital MP 2 Method (DLPNO‐MP 2),” Journal of Chemical Physics 148 (2018): 031101.29352787 10.1063/1.5011204 · doi ↗ · pubmed ↗
- 7D. G. Fedorov , J. C. Kromann , and J. H. Jensen , “Empirical Corrections and Pair Interaction Energies in the Fragment Molecular Orbital Method,” Chemical Physics Letters 706 (2018): 328–333.
- 8K. U. Lao and J. M. Herbert , “An Improved Treatment of Empirical Dispersion and a Many‐Body Energy Decomposition Scheme for the Explicit Polarization Plus Symmetry‐Adapted Perturbation Theory (Xsapt) Method,” Journal of Chemical Physics 139 (2013): 034107, 10.1063/1.4813523.23883010 · doi ↗ · pubmed ↗
