Extrapolating Local Coupled Cluster Calculations toward CCSD(T)/CBS Binding Energies of Atmospheric Molecular Clusters
Yosef Knattrup, Jonas Elm

TL;DR
This paper improves the accuracy of simulating atmospheric particle formation by benchmarking and optimizing quantum chemical methods for calculating molecular cluster binding energies.
Contribution
The study introduces a new benchmark set and demonstrates that LNO–CCSD(T) methods provide better accuracy-to-cost ratios for atmospheric clusters than DLPNO–CCSD(T0).
Findings
LNO–CCSD(T) methods outperform DLPNO–CCSD(T0) in accuracy and cost for atmospheric molecular clusters.
CBS extrapolation with aug-cc-pVTZ and aug-cc-pVQZ basis sets is feasible for LNO–CCSD(T) on clusters of 4–8 monomers.
LAF extrapolation reduces basis set errors and improves simulated cluster formation rates.
Abstract
Aerosols are the largest source of uncertainty in modern global radiative forcing modeling. Atmospheric molecular clusters are important intermediates in atmospheric new particle formation (NPF). The evaporation rate of clusters can be calculated using quantum chemical methods, with an exponential dependence on the free energy. Hence, for simulating accurate NPF rates, high-accuracy calculations are needed. We have constructed a versatile benchmark set of 218 conformers of atmospheric molecular dimer clusters consisting of sulfuric acid (SA), formic acid (FA), nitric acid (NA), methanesulfonic acid (MSA), water (W), ammonia (AM), methylamine (MA), dimethylamine (DMA), trimethylamine (TMA), and ethylenediamine (EDA) molecules. Using this test set, we benchmark the local coupled cluster methods, DLPNO–CCSD(T0) and LNO–CCSD(T), using different basis sets and locality settings, and test…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
1
2
3
4
5
6
7
8
9| Method | Basis Sets | Cardinal number |
|---|---|---|
| CCSD(T) | aug-cc-pV |
|
| CCSD(F12*)(T+) | cc-pV |
|
| LNO–CCSD(T) | aug-cc-pV | |
| aug′-cc-pV |
| |
| cc-pV | ||
| DLPNO–CCSD(T0) | aug-cc-pV | |
| aug′-cc-pV |
| |
| cc-pV | AutoAux for D |
| Basis | α(2, 3) | β(2, 3) | α(3, 4) | β(3, 4) |
|---|---|---|---|---|
| cc-pV | 4.42 | 2.46 | 5.46 | 3.05 |
| aug′-cc-pV | 4.30 | 2.51 | 5.79 | 3.05 |
| aug-cc-pV | 4.30 | 2.51 | 5.79 | 3.05 |
| Dimer | cc-pVDZ-F12 | cc-pVTZ-F12 | cc-pVQZ-F12 | CBS(aug-2,aug-3) | CBS(aug-3,aug-4) | CBS(2,3,4) |
|---|---|---|---|---|---|---|
| SA–AM | –16.314 | –16.344 | –16.290 | –16.704 | –16.350 | –16.158 |
| SA–W | –12.698 | –12.784 | –12.735 | –13.245 | –12.826 | –12.683 |
| FA–AM | –11.622 | –11.588 | –11.534 | –11.814 | –11.558 | –11.317 |
| FA–W | –10.402 | –10.438 | –10.382 | –10.718 | –10.445 | –10.225 |
- —European Commission10.13039/501100000780
- —Danmarks Grundforskningsfond10.13039/501100001732
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAtmospheric Ozone and Climate · Atmospheric chemistry and aerosols · Advanced Chemical Physics Studies
Introduction
1
Aerosols are liquid or solid particles suspended in a gas, measuring over 2 nm. They counteract the warming caused by greenhouse gases by cooling the atmosphere.? This cooling effect occurs because aerosols can directly scatter sunlight or act as cloud condensation nuclei (CCN), tiny particles that serve as the starting points for cloud droplet formation.? When more CCN are present, clouds form with a higher number of smaller water droplets, which makes them appear whiter and more reflective.? As a result, these clouds reflect more sunlight, contributing to a cooling effect on the climate. The sixth IPCC report highlights that aerosols remain the largest source of uncertainty in modeling global radiative forcing.? A significant part of this uncertainty comes from gaps in our understanding of the initial formation mechanisms.?
Aerosol formation occurs through two processes. Primary aerosol formation happens when aerosols are directly emitted into the atmosphere, such as from volcanic eruptions, sea spray, or dust storms. Secondary aerosol formation occurs when atmospheric gases undergo gas-to-particle conversion to form clusters that can grow into aerosol sizes.? This process is called new particle formation (NPF). The dominant formation mechanism depends on the local environment, but it has been shown that secondary aerosol formation accounts for up to approximately 60% of CCN over Western United States and between 30–40% over mainland Europe.?
The NPF process has been strongly linked to the clustering of sulfuric acid (SA) with bases of high abundances such as ammonia (AM) ?,?−? ? ? ? or high basicity alkyl-amines such as methylamine (MA), dimethylamine (DMA), ?,?−? ? ? ? ? ? ? ? ? trimethylamine (TMA) ?,?,?,?,?,? and ethylenediamine (EDA). ?,?−? ? ? Furthermore, species such as nitric acid (NA), ?−? ? ? ? ? ? ? ? ? ? ? formic acid (FA), ?,?,?−? ? ? ? methanesulfonic acid (MSA) ?−? ? ? ? ? ? ? ? and water (W) ?,?−? ? ? ? ? ? ? ? ? ? ? ? ? ? have been shown to enhance cluster formation.
The initial secondary formation process can be explicitly simulated as collisions and evaporation events of the monomers and clusters. Such schemes describe the cluster distribution via solutions to the birth-death equations.?
where c _ i _ is the concentration of cluster i, t the time, γ_ i→j _ the evaporation of i to form j, and s _ i,j _ the sticking coefficient for sticking-collisions between i and j. The sticking coefficient is often approximated by kinetic gas theory as the collision coefficient of hard spheres β.
The evaporation coefficients can be calculated using detailed mass-balance under the assumption that the evaporation rate does not change significantly from equilibrium:?
where c ^ e ^ is the equilibrium concentration, c ref the reference vapor concentration, ΔG the binding free energy, T the temperature, and k B the Boltzmann constant.
The exponential dependence on the free energies requires a high degree of accuracy. The largest error in cluster formation studies has been attributed to the electronic single point energy of the density functional theory (DFT) methods used to calculate the geometries and vibrational frequencies. ?,?,? To overcome this problem, a composite scheme where a higher level calculation is used for the electronic energy with the DFT-based geometries and frequencies is often applied.
To obtain accurate cluster distribution dynamics simulations, three major requirements must be satisfied for the calculated free energies:
- the free energy has to be calculated as accurately as possible to minimize the exponentially growing error in the evaporation rates,
- the method must be applicable to all compositions and sizes of clusters included in the simulation, as they require the relative free energies to be consistent, and 3) the computational cost must be reasonable as the calculations should be applicable in large screening workflows involving numerous conformers.
Commonly, only clusters containing 4–8 monomers are explicitly simulated, as larger clusters are assumed stable in the simulation and will grow spontaneously from this point to aerosol sizes. ?,?,? For these sizes, the NormalPNO DLPNO–CCSD(T)/aug-cc-pVTZ level of theory ?−? ? ? ? ? has been used as the standard single point correction, as it has been shown to perform excellently in benchmarks. ?−? ? ?
The basis sets by Dunning and co-workers ?−? ? ? ? have been developed to smoothly and systematically converge toward the complete basis set (CBS) limit as a function of increasing cardinal numbers. Therefore, by calculating successive cardinal number energies, the energies can be extrapolated toward the CBS limit. Several schemes exist, ?−? ? ? ? ? ? ? ? ? but most treat the self-consistent field (SCF) and correlation energy separately as they do not converge following the same form.
While a three-point CBS extrapolation scheme? has been used by the Shield group with great success, ?,?,?,?−? ? a systematic benchmark of basis set extrapolation schemes for atmospheric molecular clusters has not been carried out. Likewise, there exists an extrapolation scheme for the pair natural orbital (PNO) threshold toward the complete PNO space (CPS) limit for the DLPNO methods.? The PNO threshold (T_CutPNO_) controls the number of PNOs included for each electron pair based on the occupation number.? The coupled-cluster equations are then solved for the compacted virtual space for each electron pair spanned by the PNOs above the threshold. To the best of our knowledge, the CPS extrapolation scheme has not been explored for atmospheric molecular clusters.
Recently, atmospheric cluster studies have expanded to studying much larger clusters (20–30 monomers); ?−? ? ? ? ? for these sizes, while possible, the NormalPNO DLPNO–CCSD(T_0_)/aug-cc-pVTZ method’s memory per core requirements result in very inefficient usage of computational resources. Therefore, lower basis sets and PNO settings are needed to perform these computations efficiently. However, employing extrapolation schemes with lower settings calculations may help reduce the loss of accuracy compared to the usually employed NormalPNO DLPNO–CCSD(T_0_)/aug-cc-pVTZ method.
Alternatively, other local coupled cluster methods may be used. The MRCC team’s local natural orbital (LNO) methods ?−? ? ? ? ? offer similar performance and accuracy. The LNO method shares the core idea of DLPNO, by solving CCSD(T) equations in localized ”domains” where electron correlation is strongest but one of the key differences is that it does not use PNOs. Instead, it employs a (not-pair-specific) LNO basis, which is constructed from a MP2 density matrix of the extended domain around a central localized orbital. Similar to the DLPNO method’s LoosePNO, NormalPNO, and TightPNO settings, the LNO method comes with predefined approximation settings (VeryLoose, Loose, Normal, Tight, VeryTight, VeryVeryTight) that set the truncation and residual thresholds used throughout the method, ensuring they converge toward the canonical limit. These predefined approximation settings are extrapolatable to the local free approximation limit (LAF).? To the best of our knowledge, the LNO methods have not been explored for atmospheric molecular clusters.
In this study, we benchmark local coupled cluster methods using different basis sets and test extrapolation schemes to the CBS, LAF, and CPS limits. The extrapolation procedures are tested against the binding energies of high-level CCSD(F12*)(T+)/cc-pVTZ-F12 reference calculations. We applied an extended test set compared to previous work, comprising RI-MP2/aug-cc-pVTZ equilibrium structures for all possible monomers and dimers (except double base structures) of SA, FA, NA, MSA, W, AM, MA, DMA, TMA, and EDA. In addition, we also include all unique conformers above a root-mean-squared deviation (RMSD) threshold of 0.38 Å, leading to a test set of a total of 218 cluster systems.
Methodology
2
Computational Details
2.1
Configurational sampling was performed using CREST 2.12 ?,? in noncovalent interaction (NCI) mode, employing GFN1-xTB.? Optimization and numeric frequencies at the RI-MP2/aug-cc-pVTZ level of theory were performed using ORCA 5.0.4. ?−? ? The DLPNO–CCSD(T_0_) ?−? ? ? ? ? calculations were performed using ORCA 6.0.0. ?−? ?
The CCSD(F12*)(T+), ?−? ? CCSD(T), ?−? ? and LNO–CCSD(T) ?−? ? ? ? ? ? calculations were performed using MRCC (august 28, 2023 version). ?,?
The basis employed for each method is given in Table. AutoAux? was used for the double-ζ basis set calculations in ORCA. aug′ is the augmented basis sets without the extra augmentation of hydrogen (i.e, cc-pVXZ for hydrogen, aug-cc-pVXZ for the other atoms). Same-sized, matching auxiliary RI-JK? and RI ?,? basis sets were employed for the MRCC calculations. Similarly, the matching auxiliary/C and/JK auxiliary basis sets were employed for the ORCA calculations, except for the double-ζ calculations, where AutoAux was used, as ORCA does not have a matching double-sized auxiliary basis set.
1: Basis Sets Used for the Different Methods
Constructing
the Test Set
2.2
We studied all dimer cluster combinations consisting of the SA, FA, NA, MSA, W, AM, MA, DMA, TMA, and EDA molecules. We excluded the (base)2 clusters as these are very unstable under atmospheric conditions. The initial dimer structures were taken from the study of Schmitz and Elm.? All the structures were subsequently optimized at the GFN1-xTB level of theory and the lowest electronic energy conformer was used as the starting point for an NCI mode CREST simulation. All nonunique conformers from CREST were removed using a modified version of the ArbAlign program,? based on a root-mean-squared deviation threshold of 0.38 Å as suggested by Kildgaard et al. ?,? The remaining unique structures were optimized at the RI-MP2/aug-cc-pVTZ level of theory. All conformers with real vibrational frequencies were kept for the benchmark. This led to a test set of a total of 218 cluster configurations. The number of individual configurations for each cluster system is presented in .
ACDC Simulations
2.3
The birth–death equations were generated via the Atmospheric Cluster Dynamics Code (ACDC) ?,? for the (SA)1–4(AM/DMA)1–4 structures by Kubečka et al.? The (SA)5(base)4 (SA)4(base)5 were set as the outgrowing clusters. All single acid clusters were counted toward the total acid concentration. Size-dependent coagulation sinks were activated using values of cs_exp = −1.6 and cs_ref = 1 × 10^–3^ s^–1^. The coagulation sink parameters were chosen to match typical values for the boundary layer.?
The Jammy Key framework? was used to automate the workflow, data processing, and setup of the ACDC simulations.
Extrapolation Schemes
2.4
For the basis set extrapolations, we use the two-point extrapolation scheme with coefficients suggested by Neese and Valeev.? Here the SCF energy can be extrapolated as?
where X is the cardinal number of the smallest basis set and S is an exponential function given as
where Q is the cardinal number given as input and α is an extrapolation coefficient.
The correlation energy is extrapolated as?
where β is another extrapolation coefficient.
The extrapolation coefficients are given in Table. It should be noted that for the aug′ basis sets, as specific coefficients have not been fitted, we used the corresponding extrapolation coefficient from the fully augmented basis sets. This decision was based on the insensitivity between the extrapolation coefficients of the cc-pVXZ and aug-cc-pVXZ basis sets, and that the aug′-cc-pVXZ basis set is most similar to the aug-cc-pVXZ basis set. Initial testing also showed it did not lead to any obvious outliers.
2: Extrapolation Coefficients Suggested by Neese and Valeev ,
The CPS and LAF extrapolation follow the same form with the same extrapolation coefficient of F = 1.5,
where LOW refers to the lower-level setting and HIGH to the higher-level setting. The LNO methods come with predefined locality settings, where two successive settings can be used to extrapolate to the LAF limit. For the CPS extrapolation, the PNO cutoff was 1 order of magnitude higher for the high-level, while the other parameters (T CutPairs = 10^–4^ and T CutMKN = 10^–3^) were kept fixed according to the NormalPNO settings.? An example of a lower-level and higher-level setting would be NormalPNO at 3.33 × 10^–7^ and NormalPNO with 3.33 × 10^–8^, respectively, for the DLPNO method. For the LNO method, it could be from the “Normal” and “Tight” locality settings, respectively.
We will denote a basis set extrapolation using cardinal numbers X and X + 1 as CBS(X,X + 1). For the augmented basis set, we will use “aug-” or “aug′-” in front of the cardinal numbers. A DLPNO calculation with the PNO threshold set at 3.33 × 10^–T^ will be denoted PNO(T). Extrapolations of PNO settings will use the CPS(T,T + 1) notation. The LNO locality setting will follow the naming convention of the MRCC manual. Extrapolation of the settings will be denoted LAF(LOW,HIGH), where the setting name is shortened until the first capital letter. (i.e., vLoose and Loose is denoted LAF(vL,L)).
Five settings were probed for the DLPNO methods: NormalPNO with the PNO threshold set from 2 orders of magnitude lower to 2 orders of magnitude higher. Six settings for the LNO methods: vvLoose to vvTight. The basis sets tested were cc-pVXZ, aug′-cc-pVXZ (no augmentation on hydrogen), and aug-cc-pVXZ (X = D,T,Q).
Binding
Energies
2.5
The studied property is the electronic binding energy, which is defined as the electronic energy required to bring the monomers together from infinitely far apart to form the cluster,
where E cluster is the electronic energy of the dimer, and E monomer is the electronic energy of the monomer making up the dimer. The monomer and dimer electronic energies are always computed at the same level of theory.
Results
and Discussion
3
Reference Methods
3.1
A suitable reference is needed to benchmark the local coupled cluster methods. We chose the F12 methods as they have been shown to give results close to the CBS limit for binding energies of atmospheric molecular clusters.? We used the (T+) triple treatment, which is a size-consistent version of the (T*) treatment, because it exhibits smaller basis set error for binding energies.? To further validate the convergence of the CCSD(F12*)(T+) methods, we calculated the binding electronic energies from cc-pVDZ-F12 to cc-pVQZ-F12 of the SA–W, SA–AM, FA–W, and FA–W dimer clusters, as these are small enough for the cc-pVQZ-F12 calculation and span the acid–water, acid–base, organic–water, and organic–base interactions, respectively. We also compared these to the canonical CCSD(T)/CBS(aug-2,aug-3), CCSD(T)/CBS(aug-3,aug-4), and CCSD(T)/CBS(2,3,4) binding energies.
The calculated binding energies can be seen in Table. All the F12 calculations give binding electronic energies within 0.09 kcal/mol of each other, with the largest difference being between cc-pVDZ-F12 and cc-pVQZ-F12 for the FA–AM dimer and cc-pVDZ-F12 and cc-pVTZ-F12 for the SA–W dimer. This indicates that binding energies are already well converged at the F12 level of theory when using the cc-pVDZ-F12 basis set. Similarly, if we compare the F12 values to the extrapolated canonical CCSD(T) results, we get within 0.1 kcal/mol between cc-pVQZ-F12 and CBS(aug-3,aug-4). The CBS(aug-2,aug-3) energies diverge a bit more with a deviation of up to 0.5 kcal/mol compared to the cc-pVQZ-F12 results. This suggests that we might need to include quadruple-size basis sets in the extrapolation procedure to get well-converged energies. Finally, the CBS(2,3,4) 4–5 inverse polynomial extrapolation? used by the Shields’ group ?,?,?,?−? ? also performs similarly to the F12 results with the largest deviation, to the cc-pVQZ-F12 result, being 0.2 kcal/mol for the FA–AM dimer.
3: Binding Electronic Energies of the Small Dimers for the CCSD(F12)(T+) and CCSD(T) Calculations as a Function of the Basis Set Size and Extrapolation Scheme*
The CCSD(F12*)(T+)/cc-pVQZ-F12 binding energies would have been ideal as a reference, but these become too computationally expensive for the larger dimers in our test set. Therefore, we chose the CCSD(F12*)(T+)/cc-pVTZ-F12 binding energies as the reference values, as these are feasible for all the dimer clusters and well-converged. However, this also implies that we will have to accept an intrinsic error on the order of 0.1 kcal/mol, which is still a much lower error than current cluster formation studies.
Dimer Clusters
3.2
We calculated the binding energies of the entire test set for all combinations of basis sets, extrapolations of basis sets, locality settings, and extrapolations of locality settings for the DLPNO–CCSD(T_0_) and LNO–CCSD(T) methods. Including all possible ways of extrapolating yields 300 different ways of calculating the electronic binding energy for each system. Due to computational limits, not all systems have been converged for the PNO(9) calculations.
aug-cc-pVXZ
3.2.1
We start by comparing the results from the largest studied basis sets, aug-cc-pVXZ. The results for the DLPNO method are shown in Figure and the results for the LNO method in Figure. From the DLPNO results, it is clear that the aug-cc-pVDZ basis set is too small (first row of Figure). We observe a mean absolute error (MAE) of roughly 0.7 kcal/mol and a max absolute error of roughly 2.4 kcal/mol for all PNO settings and extrapolations. The insensitivity to the PNO threshold confirms that the basis set error is the dominating error. The aug-cc-pVTZ calculations halve (≈ 0.3 kcal/mol) the MAE compared to the aug-cc-pVDZ results. Furthermore, the CPS(6,7) extrapolation is needed before the basis set error dominates, as the max absolute error is the same at this setting/extrapolation and higher. For the usually employed PNO(7) DLPNO–CCSD(T_0_)/aug-cc-pVTZ level of theory, we have a low MAE of 0.28 kcal/mol, however, the max absolute error is still quite substantial at 1.1 kcal/mol. These errors are consistent with the previous results by Schmitz and Elm.? The large max error is mainly caused by the tail of underbinding (to the right) clusters comprising the (MSA)1(EDA)1, (MSA)1(MA)1, and (SA)1(MA)1 clusters. The max error could be reduced to 0.82 kcal/mol by performing the PNO(6) calculation and extrapolation between it and the PNO(7) setting.
Binding energy errors of the DLPNO method using the augmented basis sets compared to the CCSD(F12)(T+)/cc-pVTZ-F12 level of theory. The PNO() labels indicate calculations performed at NormalPNO settings with the PNO thresholds set using the number as the exponent. CPS(,) and CBS(,) denote complete PNO space and complete basis set extrapolations, respectively. Each box spans 10 kcal/mol. Negative values (left) indicate overbinding relative to the reference method.*
Binding energy errors of the LNO method using the augmented basis sets compared to the CCSD(F12)(T+)/cc-pVTZ-F12 level of theory. LAF(,) and CBS(,) denote local approximation free and complete basis set extrapolations, respectively. Each box spans 10 kcal/mol. Negative values (left) indicate overbinding relative to the reference method.*
Extrapolating between the double- and triple-ζ basis sets always reduces the MAE, except for the PNO(5) setting. Following along the row, the max absolute error does not always get lower when extrapolating; this happens for the CPS(7,8) extrapolation and above. This is likely due to the loss of cancellation of errors for the CPS(7,8) extrapolation compared to the PNO(7) setting. This also shows that routine cluster formation studies employing the normal PNO(7) setting with an aug-cc-pVTZ basis set could obtain increased accuracy by performing CBS(aug-2,aug-3) extrapolation.
Using the aug-cc-pVQZ basis set reduces the span of errors. However, there is an increase in the MAE for the PNO(7) setting and below. This is due to a general trend where a larger basis set causes all the errors to shift toward overbinding (to the left), whereas the locality error shifts it toward underbinding (to the right). For the CPS(6,7) extrapolation and higher, the reduced error in the locality approximation reduces the spread and hence the MAE. Extrapolating between the triple and quadruple basis sets again does not necessarily decrease the MAE, as the previous basis set sizes allowed for the basis set error to partly cancel with the locality error, allowing the error distribution to be centered around zero. Following along the row for the CBS(aug-3,aug-4) results, the error distributions start around zero, with the PNO settings controlling the span. The remaining local error is unlikely to be due to the (T_0_) approximation as Schmitz and Elm? found minimal change in the error distribution using the (T_1_) approach compared to the CCSD(T)-F12/CBS reference. A way to lower the local error would be to add the F12 correction. Schmitz and Elm? showed this can significantly lower the error, but it comes with a memory requirement that would make it unsuitable for larger molecular clusters.
In general, we observe that the commonly applied DLPNO–CCSD(T_0_)/aug-cc-pVTZ results rely on the cancellation of errors between the basis set error and the locality error to give an error distribution centered around zero. For the fully augmented basis set, a DLPNO–CCSD(T_0_)/CBS(aug-2,aug-3) energy calculation using the PNO(7) settings would be an ideal compromise between accuracy and computational cost.
For the LNO method, the small aug-cc-pVDZ basis set still yields large basis set errors with MAEs of roughly 0.9 kcal/mol and max absolute errors of up to 3.36 kcal/mol. Tightening the locality setting does not significantly change the MAE, as the basis set error dominates. It only shifts the error distribution toward overbinding (left). Similar to DLPNO, increasing the basis set size to aug-cc-pVTZ roughly halves the MAE and max absolute error in all cases. When tightening the locality setting, we see a shift in the error distribution toward overbinding (left) causing it to deviate from zero. Doing an extrapolation between the basis sets, does not significantly change the MAE, but it reduces the span of errors. For instance, going from Normal LNO–CCSD(T)/aug-cc-pVTZ to Normal LNO–CCSD(T)/CBS(aug-2,aug-3) reduces the error span from 1.86 to 1.12 kcal/mol. Similarly to the DLPNO results, tightening the locality setting does not decrease the MAE as the error distribution is already centered around zero at the vLoose setting. For the tighter locality settings, the max absolute error increases due to a tail of larger errors from the (MSA)1(SA)1, (MSA)2, and (SA)2 dimer clusters. Calculating the energies at the aug-cc-pVQZ level massively reduces the MAE by a factor of 3 compared to the CBS(aug-2, aug-3) results and decreases the max absolute error by more than a factor of 2 for all locality settings except vLoose. This massive decrease in error is due to the distribution exhibiting less spread but also a near-perfect cancellation of errors where the basis set error shifts toward underbinding (right) but the locality error shifts toward overbinding (left). This combination results in a narrow error distribution centered on zero for the Normal LNO–CCSD(T)/aug-cc-pVQZ level of theory. This is also the reason the results do not necessarily improve when performing the CBS(aug-3,aug-4) extrapolation, since we end up in the same scenario as the DLPNO method, where the error distribution roughly starts at zero, with the tightness of the locality setting controlling the width.
In general, we find that, at comparable settings to the usually employed NormalPNO DLPNO–CCSD(T_0_)/aug-cc-pVTZ calculation, DLPNO benefits from excellent cancellation of errors. This ideal cancellation only happens for the LNO method with the larger aug-cc-pVQZ basis set. However, the error distribution is much narrower for the LNO methods at the larger basis set sizes compared to the DLPNO method, suggesting the locality error is lower for the LNO methods. Furthermore, we find the LNO methods to be much more computationally efficient (will be discussed in greater detail in sections and ?) compared to similar settings for DLPNO. This means that more accurate LNO calculations are readily achievable on larger systems. Since the goal is to achieve the lowest error possible on all system sizes, we will, from now on, focus on the LNO methods; however, we have added the equivalent DLPNO benchmarks for the cc-pVXZ and aug′-cc-pVXZ basis sets in the .
aug′-cc-pVXZ
3.2.2
Due to the noncovalent bonded nature of atmospheric clusters, diffuse functions should, in principle, be added to the basis functions to better describe the interactions. The problem is that these functions can substantially increase the computational cost. Furthermore, they might cause SCF convergence problems due to linear dependencies in the basis.? A compromise would be to only augment all non-hydrogen atoms with diffuse functions. The Dunning basis sets following this augmentation philosophy are denoted the aug′-cc-pVXZ basis sets, and the results for the LNO method are shown in Figure and the DLPNO method in .
Binding energy errors of the LNO method using the reduced augmented basis sets compared to the CCSD(F12)(T+)/cc-pVTZ-F12 level of theory. LAF(,) and CBS(,) denote local approximation free and complete basis set extrapolations, respectively. Each box spans 10 kcal/mol. Negative values (left) indicate overbinding relative to the reference method.*
The aug′-cc-pVDZ basis set does still not perform well, with a MAE of 0.54–0.71 kcal/mol and max absolute errors of 1.97–2.15 kcal/mol. Interestingly, at the aug′-cc-pVTZ basis set size, the basis set error makes the error distribution centered at zero for the Normal setting. This yields similar errors to the aug-cc-pVTZ DLPNO method but for a smaller basis set. However, tightening the locality setting does not significantly increase the accuracy, as the basis set error still dominates. For the vTight setting and higher, the max absolute error increases because the error distribution shifts away from being centered at zero. Similarly, the CBS(aug′-2,aug′-3) extrapolation does not perform much better due to the same reason, however, the error span does improve, going from roughly 1.5 kcal/mol to roughly 1.1 kcal/mol compared to the aug′-cc-pVTZ results. Increasing the basis set size to aug′-cc-pVQZ does not decrease the MAE for the Tight setting and lower. The max absolute error also slightly decreases at the LAF(vL,L) extrapolation and higher. Similarly, performing the CBS(aug′3-aug′-4) extrapolation does not significantly change the accuracy and, for several cases, actually decreases the accuracy a bit. However, the extrapolation does benefit from tighter locality settings as the three error measures can be halved by going from Normal to vTight. While the error measures are higher than the equivalent locality setting for the augmented basis set, the error distribution is still narrow, and the slight increase in error might be worth the gain in computational efficiency.
cc-pVXZ
3.2.3
While the nonaugmented basis might lack the extra diffuse functions, the extrapolations might make up for it, and being able to use them would result in a massive increase in computational speed compared to the augmented basis sets. The results for the LNO methods are shown in Figure and the DLPNO method in . The smaller cc-pVDZ and cc-pVTZ basis sets have the largest max absolute errors of up to 5.52 and 2.21 kcal/mol, respectively. While the cc-pVTZ and CBS(2,3) cases have reasonable MAEs down to 0.62 and 0.31 kcal/mol, respectively, the large max absolute error still makes them less than ideal, yet still shows an improvement over the original DLPNO method. Increasing the basis set size to cc-pVQZ decreases the MAE and max absolute error because the error distribution gets centered around zero; however, the error span largely stays the same with and without extrapolation at this basis set size. Furthermore, for all the basis sets at or below this size, there is not a substantial change in accuracy when increasing from the Normal locality setting and above, indicating the basis set error is quite substantial for the nonaugmented basis sets. Interestingly, while the previous basis sets sizes and extrapolation did not perform as well as the equivalent (reduced) augmented basis set. The CBS(3,4) extrapolation performs excellently. For instance, the Normal CBS(3,4) extrapolation has a MAE of 0.11 kcal/mol, a max absolute error of 0.37 kcal/mol, and an error span of 0.61 kcal/mol. It performs as well as the equivalent augmented basis set and setting (MAE of 0.16 kcal/mol, max ABS of 0.38 kcal/mol, error span of 0.47 kcal/mol). Furthermore, it outperforms the equivalent reduced augmented basis set and setting with nearly half the MAE and max absolute error and a similar error span.
Binding energy errors of the LNO method using the nonaugmented basis sets compared to the CCSD(F12)(T+)/cc-pVTZ-F12 level of theory. LAF(,) and CBS(,) denote local approximation free and complete basis set extrapolations, respectively. Each box spans 10 kcal/mol. Negative values (left) indicate overbinding relative to the reference method.*
This indicates that the Normal LNO–CCSD(T)/CBS(3,4) level of theory might be a sweet spot between accuracy and computational cost for calculating the binding energies of (smaller) atmospheric molecular clusters. However, as this might not hold for larger clusters (will be discussed in Section), we still recommend applying LNO–CCSD(T)/CBS(aug-3,aug-4) with at least the Normal setting for the usually studied cluster sizes of 2–8 monomers, as these should be readily attainable.
Timing Trends of the Different Methods
3.3
Figure shows the walltime for a LNO–CCSD(T) single-point calculation of (SA)1(DMA)1 using the given settings relative to the usually employed NormalPNO DLPNO–CCSD(T_0_)/aug-cc-pVTZ using 10 cores. It should be noted that these relative timings will change for larger system sizes, and this is primarily for a quick estimate.
Walltime for a LNO–CCSD(T) single-point calculation of (SA)1(DMA)1 using the given settings relative to the usually employed NormalPNO DLPNO–CCSD(T0)/aug-cc-pVTZ using 10 cores. The calculations were performed using 10 parallel MPI processes in ORCA or 10 OpenMP threads in MRCC, both on an Intel Xeon Platinum 8358 CPU using a dedicated local disk.
Increasing from the Normal to Tight setting increases the time by a factor of 2–3, but it does not substantially change the accuracy for the dimer test set. A similar trend is observed for augmented versus the nonaugmented basis sets with a factor of 2 difference. Reducing the basis set from the augmented to the reduced augmented aug′ basis does not significantly change the walltime, but comes with a slight increase in error for the dimers. However, we find that for much larger clusters (see Section) the timing difference is much larger and the fully augmented basis set calculations are not possible, but the reduced augmented are.
Comparing to the original DLPNO method, it is possible to perform the Normal CBS(aug-2,aug-3), Normal CBS(aug′-2,aug′-3), or Normal CBS(2,3) calculation for the same amount of time. Likewise, using a bit more time, it is possible to perform the Normal CBS(3,4) calculation, which showed excellent accuracy in the dimer benchmark.
The Effects on Simulated New Particle Formation
Rates
3.4
As concluded in Section, an extrapolation scheme is needed to reduce the basis set error and obtain accurate binding energy results. For the most part, the benchmark indicates the error measures are fairly insensitive to the tightness of the locality setting as long as they are at or above the Normal setting. However, the benchmark is limited to the dimer systems as these are the only ones small enough to obtain a high-level reference for. The goal is to simulate the new particle formation rate with cluster sizes much larger than the dimers. To probe this, we test the sensitivity of the different extrapolations’ effect on the simulated new particle formation rates of the (SA)1–4(DMA)1–4 and (SA)1–4(AM)1–4 systems sampled by Kubečka et al.? We keep the locality setting at Normal or Tight, and extrapolation of them due to the aforementioned lack of sensitivity.
Figure shows the simulated new particle formation rates of the (SA)1–4(DMA)1–4 cluster system using single points at all possible (2,3) and (3,4) extrapolations along with the Normal and Tight and the LAF(N,T) extrapolation. The temperature was set to 278.15 K with an SA concentration of either 10^7^ molecules cm^–3^ (denoted high) or 10^6^ molecules cm^–3^ (denoted low) and 10 ppt of DMA. The original DLPNO bar is the NormalPNO DLPNO–CCSD(T_0_)/aug-cc-pVTZ level.
New particle formation rate as calculated by ACDC for the (SA)1–4(DMA)1–4 system with the single point at the given LNO–CCSD(T) level given by the labels. The thermal contribution is calculated at the ωB97X-D/6-31++G(d,p) level of theory. The temperature was set to 278.15 K, the SA concentration at 107 molecules cm–3, and the DMA concentration at 10 ppt.
High SA Concentration
3.4.1
For simulations using high SA concentrations, even with the exponential dependence on the free energy, all the different binding energy schemes yield new particle formation rates within a factor of 5 of each other. Hence, simulated NPF rates are relatively insensitive to the choice of high-level methods for calculating the binding energies. The rates are also relatively insensitive to the locality setting at this concentration, with the largest relative change being a drop by 369 cm^–3^ s^–1^ for the Normal to Tight setting at the CBS(aug-2,aug-3) level. It should be noted that this is also the largest total rate.
The behavior of the rates can be explained by studying the Gibbs free energies under the given conditions (see ). We observe that for all levels of theory, the initial dimer formation constitutes the highest nucleation barrier, and the rate directly follows its free energy. For instance, the DLPNO, CBS(2,3), CBS(aug′-3,aug′-4) and CBS(aug-3,aug-4) single-points all predict the dimer free energy under given conditions to be around 2.5–2.8 kcal/mol, whereas the CBS(aug′-2,aug-’3) and CBS(3,4) levels predict it to be around 2.1–2.2 kcal/mol, and the CBS(aug-2,aug-3) level around 1.9–2.0 kcal/mol.
From the high-concentration simulations, we can conclude that for systems or concentrations where the initial dimer is the rate-determining step, the results are somewhat insensitive to the level of theory as long as they predict the dimer free energy similarly. For instance, the NormalPNO DLPNO–CCSD(T_0_)/aug-cc-pVTZ level and the CBS(2,3) Tight level both predict a NPF rate of roughly 850 cm^–3^ s^–1^ due to identical free energy of the dimer cluster, despite the (SA)4(DMA)4 cluster being 3.7 kcal/mol lower in free energy for the DLPNO method.
Low SA Concentration
3.4.2
The NPF rate is also dependent on the precursor concentrations. Changing the concentration will change the rate-determining step. Figure shows the simulated NPF rates at a SA concentration of 10^6^ cm^–3^. Decreasing the concentrations expands the rate-determining clusters to include the (SA)2(DMA)1 cluster. Furthermore, the (SA)2(DMA)2 cluster also has positive free energy at the given conditions for the CBS(3,4) Tight setting and above. The high-dimensional parameter space makes it difficult to directly couple the overall NPF rate to specific cluster free energies, but some general trends can be observed. Overall, the NPF rates vary more compared to the high-concentration scenario because the important cluster free energies are no longer predicted similarly between the methods. For instance, the rates, compared to the high SA concentration, decrease substantially with increasing tightness of the locality settings. The DLPNO–CCSD(T_0_)/aug-cc-pVTZ rate of 1.57 cm^–3^ s^–1^ is also substantially higher than the CBS(aug-3,aug-4) results (0.13–1.57 cm^–3^ s^–1^). This indicates that for systems where the rate is low and several clusters in the simulated system are rate-determining, the different methods can no longer be expected to yield similar results. However, as in the high SA concentration simulation, the differences between the CBS(aug′-3,aug′-4) and CBS(aug-3,aug-4) results for the same locality settings are low (below 0.03 cm^–3^ s^–1^).
New particle formation rate as calculated by ACDC for the (SA)1–4(DMA)1–4 system with the single point at the given LNO–CCSD(T) level given by the labels. The thermal contribution is calculated at the ωB97X-D/6-31++G(d,p) level of theory. The temperature was set to 278.15 K, the SA concentration at 106 molecules cm–3, and the DMA concentration at 10 ppt.
Lastly, to study the behavior of a loosely bound system, we simulated the NPF rates of the (SA)1–4(AM)1–4 system at 278.15 K, an SA concentration of 10^7^ molecules cm^–3^ and 10 ppb of AM (see Figure). For these conditions, the free energy is positive (at minimum 6 kcal/mol) for all methods, with the highest free energy cluster being the (SA)3(AM)2 cluster. Studying the possible growth paths, we find several barriers in the 5–7 kcal/mol range. Similarly to the lower concentration SA–DMA system, the rates drop with tighter locality settings and there is still a high variability for the rates. However, it should be noted that as the absolute NPF rates for the SA–AM system are relatively low, small differences between the methods are “insignificant” compared to other errors. Unlike the two previous examples, the CBS(aug′-2,aug′-3) extrapolation and CBS(3,4) no longer give the same rates, and the CBS(3,4) instead follows the CBS(aug-2,aug-3) rates. However, similar to before, the CBS(aug′-3,aug′-4) and CBS(aug-3,aug-4) extrapolations yield quite similar rates. This illustrates the robustness of these extrapolation procedures.
New particle formation rate as calculated by ACDC for the (SA)1–4(AM)1–4 system with the single point at the given LNO–CCSD(T) level given by the labels. The thermal contribution is calculated at the ωB97X-D/6-31++G(d,p) level of theory. The temperature was set to 278.15 K, the SA concentration at 107 molecules cm–3, and the AM concentration at 10000 ppt.
Finally, to compare the rates of the wave function-based methods to commonly used DFT methods, we chose to test single-point energies for the SA–DMA system at the ωB97–3c, ωB97X-D3BJ/6-311++G(3df,3pd), PW91/6-311++G(3df,3pd), and M06-2X/6-311++G(3df,3pd) levels of theory. For conditions matching the low SA concentration regime, the simulations yield rates of 0.13, 7.02, 5 × 10^–15^, and 0.02 cm^–3^ s^–1^, respectively. Except for the PW91 method, this is fairly decent accuracy compared to the DLPNO–CCSD(T_0_)/aug-cc-pVTZ value of 1.57 molecules cm^3^ s^–1^. Hence, one would assume that DFT could potentially reveal the correct trends, but the absolute value could be slightly off.
From this section, we can conclude that for systems where the formation of the initial dimer cluster is rate-determining, the recommended methodology for calculating the binding energies follows the benchmark results in section. However, for systems where larger clusters are rate-determining, an increased sensitivity to the locality setting is observed, with decreasing NPF rates with a tighter locality setting. This indicates that the larger clusters’ local approximation error is more prominent. Hence, for studies on systems containing up to eight monomer clusters and extreme accuracy is warranted, we suggest the LAF(N,T) LNO–CCSD(T)/CBS(aug-3,aug-4) level for calculating the binding energies, as it should, in principle, be closest to CCSD(T)/CBS. However, the error in the thermal contribution is likely larger, and additional computational effort should perhaps be put toward reducing it instead.
Toward Accurate Binding Energies of Freshly
Nucleated Particles
3.5
From the previous sections, it is clear that the LNO methods are a convenient choice for modeling large clusters. We are currently pushing the boundaries for the sizes of atmospheric molecular clusters we can accurately model. This has led to (SA)_ n (base) n _ clusters with n up to 15 being sampled. However, for our previous studies of such large clusters, we have been limited to DFT using B97–3c for calculating the binding energies, which poses a potential large source of error. One of the largest clusters studied to date is the (SA)15(TMA)15 cluster (300 atoms) as depicted in Figure. ?,?
(SA)15(TMA)15 cluster at the B97–3c level of theory found by Wu et al. ,
We find it possible to calculate the LAF(N,T) LNO–CCSD(T)/CBS(aug′-3,aug′-4) level of theory for a cluster of this size. Comparing the electronic binding energy to the original B97–3c level of theory, we find that B97–3c yields a value of −748.2 kcal/mol compared to −778.0 kcal/mol at the LAF(N,T) LNO–CCSD(T)/CBS(aug′-3,aug′-4) level. This massive 29.8 kcal/mol difference implies that the previously calculated binding energies are underbinding. This work illustrates that high-accuracy single points are now attainable even for very large clusters. If an even higher level of theory is needed, this should be attainable with machine learning methods. We suggest Δ-ML with kernel ridge regression (KRR) as we have previously shown that it can easily yield sub kcal/mol errors when trained on few conformers. ?−? ? However, if more data is available, the PaiNN architecture? could also reach similar errors.? In addition, KRR has been shown to be able to extrapolate well to larger cluster sizes. ?,? Hence, we can train Δ-ML KRR model on smaller (acid)1–15(base)1–15 clusters and accurately predict the binding energies of even larger clusters than previously studied. This will enable hitherto unseen accuracy of the binding energies of large atmospheric molecular clusters. Another possibility could be to employ cluster-in-molecule (CIM) or similar fragmentation approaches.?
Conclusions
4
We have constructed an extended test set of RI-MP2/aug-cc-pVTZ equilibrium structures of all possible monomers and dimers (except double base structures) of sulfuric acid (SA), formic acid (FA), nitric acid (NA), methanesulfonic acid (MSA), water (W), ammonia (AM), methylamine (MA), dimethylamine (DMA), trimethylamine (TMA), and ethylenediamine (EDA). This leads to a versatile testset of 218 cluster structures.
We benchmark the local coupled cluster methods, DLPNO–CCSD(T_0_) and LNO–CCSD(T), on the test set using different basis sets and extrapolation procedures to the complete basis set (CBS), local approximation free (LAF), and complete PNO space (CPS) limits. The extrapolations are tested against the binding energies of high-level CCSD(F12*)(T+)/cc-pVTZ-F12 reference calculations.
Five settings were probed for the DLPNO methods: NormalPNO with the PNO threshold set from 2 orders of magnitude lower to 2 orders of magnitude higher. Six settings for the LNO methods: vvLoose to vvTight. The basis sets tested were cc-pVXZ, aug′-cc-pVXZ (no augmentation on hydrogen), and aug-cc-pVXZ (X = D,T,Q).
Using the augmented basis sets with the DLPNO method, we find that the commonly applied DLPNO–CCSD(T_0_)/aug-cc-pVTZ results rely on the cancellation of errors between the basis set error and the locality error to give an error distribution centered around zero. This ideal cancellation only happens for the LNO method with the larger aug-cc-pVQZ basis set. However, the error distribution is much narrower for the LNO methods at the larger basis set sizes compared to the DLPNO method, and we find the cost-to-accuracy ratio to be better as well. We therefore chose to focus on the LNO methods. In general, we also find the error to be fairly insensitive to the locality settings and extrapolations thereof for the LNO methods on the dimer test set.
Testing the reduced augmented basis set (everything but H is augmented), we find the error measures to be higher than the equivalent locality setting for the augmented basis set, However, the error distribution is still narrow, and the slight increase in error might be worth the gain in computational efficiency.
For the nonaugmented basis sets, we find that the CBS(3,4) extrapolation performs excellently and is nearly identical with the fully augmented basis set results. This indicates that the Normal LNO–CCSD(T)/CBS(3,4) level of theory might be a sweet spot between accuracy and computational cost for calculating the binding energies of (smaller) atmospheric molecular clusters. However, as this might not hold for larger clusters, we suggest doing the LNO–CCSD(T)/CBS(aug-3,aug-4) calculation at at least the Normal setting for the usually studied cluster sizes of 2–8 monomers.
Using the Atmospheric Molecular Dynamics Code, we simulated the new particle formation rate of the (SA)1–4(AM)1–4 and (SA)1–4(DMA)1–4 systems with different basis set extrapolation, the Normal and Tight locality settings and extrapolation of them. We find an increased sensitivity to the locality settings for larger clusters compared to the dimers, but the basis set error is still the most dominant; i.e., the rates would benefit from doing an LAF extrapolation, but the thermal contribution is likely a much larger source of error at this point.
Finally, we show for the large (SA)15(TMA)15 cluster (300 atoms), that it is possible to perform the CBS extrapolation using aug′-cc-pVTZ and aug′-cc-pVQZ together with the LAF extrapolation using the Normal and Tight settings for the LNO–CCSD(T) method. This illustrates that employing the LNO methods, high-accuracy single points are now attainable even for very large clusters.
Supplementary Material
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Lee, H. ; Calvin, K. ; Dasgupta, D. ; Krinner, G. ; Mukherji, A. ; Thorne, P. ; Trisos, C. ; Romero, J. ; Aldunce, P. ; Barret, K. IPCC, 2023: Climate Change 2023: Synthesis Report, Summary for Policymakers; Contribution of Working Groups I, II and III to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change; Core Writing Team, Lee, H. , Romero, J. , Eds.; IPCC: Geneva, Switzerland, 2023.
- 2Boucher O.Lohmann U.The Sulfate-CCN-Cloud Albedo Effect Tellus B: Chem. Phys. Meteorol.19954728130010.1034/j.1600-0889.47.issue 3.1.x · doi ↗
- 3Twomey S.Pollution and The Planetary Albedo Atmos. Environ.197481251125610.1016/0004-6981(74)90004-3 · doi ↗
- 4Tröstl J.Chuang W. K.Gordon H.Heinritzi M.Yan C.Molteni U.Ahlm L.Frege C.Bianchi F.Wagner R.The Role of Low-Volatility Organic Compounds in Initial Particle Growth in the Atmosphere Nature 201653352753110.1038/nature 1827127225126 PMC 8384036 · doi ↗ · pubmed ↗
- 5Kulmala M.Kontkanen J.Junninen H.Lehtipalo K.Manninen H. E.Nieminen T.PetäjäT.SipiläM.Schobesberger S.Rantala P.Direct Observations of Atmospheric Aerosol Nucleation Science 201333994394610.1126/science.122738523430652 · doi ↗ · pubmed ↗
- 6Zhao B.Donahue N. M.Zhang K.Mao L.Shrivastava M.Ma P.-L.Shen J.Wang S.Sun J.Gordon H.Global Variability in Atmospheric New Particle Formation mechanisms Nature 20246319810510.1038/s 41586-024-07547-138867037 PMC 11222162 · doi ↗ · pubmed ↗
- 7Kirkby J.Curtius J.Almeida J.Dunne E.Duplissy J.Ehrhart S.Franchin A.GagnéS.Ickes L.Kürten A.Role of Sulphuric Acid, Ammonia and Galactic Cosmic Rays in Atmospheric Aerosol Nucleation Nature 201147642943310.1038/nature 1034321866156 · doi ↗ · pubmed ↗
- 8Schobesberger S.Junninen H.Bianchi F.Lönn G.Ehn M.Lehtipalo K.Dommen J.Ehrhart S.Ortega I. K.Franchin A.Molecular Understanding of Atmospheric Particle Formation from Sulfuric Acid and Large Oxidized Organic Molecules Proc. Natl. Acad. Sci. U.S.A.2013110172231722810.1073/pnas.130697311024101502 PMC 3808659 · doi ↗ · pubmed ↗
