Nonlocal Machine-Learned Exchange Functional for Molecules and Solids
Kyle Bystrom, Boris Kozinsky

TL;DR
This paper introduces a machine-learned, nonlocal exchange functional for DFT that achieves hybrid-DFT accuracy at a computational cost comparable to semilocal functionals, benefiting both molecules and solids.
Contribution
It develops a novel machine learning-based exchange functional that is orbital-dependent and nonlocal, offering high accuracy with low computational cost for diverse systems.
Findings
Achieves hybrid-DFT accuracy on thermochemical benchmarks.
Improves band gap predictions over semilocal DFT.
Demonstrates scalability in large supercell calculations.
Abstract
The design of better exchange-correlation functionals for Density Functional Theory (DFT) is a central challenge of modern electronic structure theory. However, current developments are limited by the mathematical form of the functional, with efficient semilocal functionals being inaccurate for many technologically important systems and the more accurate hybrid functionals being too expensive for large solid-state systems due to the use of the exact exchange operator. In this work, we use machine learning combined with exact physical constraints to design an exchange functional that is both orbital-dependent and nonlocal, but which can be evaluated at roughly the cost of semilocal functionals and is significantly faster than hybrid DFT in plane-wave codes. By training functionals with several different feature sets, we elucidate the roles of orbital-dependent and nonlocal features in…
| Model | 1e Systems111The 1-electron system dataset consists of the energies of \cfH2+, \cfH3^2+-linear, and \cfH3^2+-triangular referenced to the energy of the isolated hydrogen atom. All nearest-neighbor bond lengths are 0.7 Å. | 2e Systems222The 2-electron system dataset consists of three energy differences between \cfHe+, \cfHe-singlet, and \cfHe-triplet, where \cfHe-triplet is the neutral helium atom excited state with both electrons having the same spin. |
|---|---|---|
| SL-GGA | ||
| NL-GGA | ||
| SL-MGGA | ||
| NL-MGGA |
| Functional | No. Unconverged |
|---|---|
| SL-GGA | 10 |
| NL-GGA | 16 |
| SL-MGGA | 5 |
| NL-MGGA | 3 |
| NL-MGGA-PBE | 1 |
| NL-MGGA-DTR | 0 |
| PBE333This work. | r2SCAN11footnotemark: 1 | SL-MGGA11footnotemark: 1 | NL-MGGA-DTR11footnotemark: 1 | HSE06444Borlido et al. [42] | PBE022footnotemark: 2 | ||
|---|---|---|---|---|---|---|---|
| ME555Mean error (eV) | |||||||
| MAE666Mean absolute error (eV) | |||||||
| RMSE777Root mean square error (eV) | |||||||
| MARE888Mean absolute relative error | |||||||
| PBE | r2SCAN | CIDER | HSE06 | Expt. [42] | |
| Si | |||||
| Ge | |||||
| InP | |||||
| GaAs | |||||
| CdSe | |||||
| BP | |||||
| GaP | |||||
| CdS | |||||
| GaN | |||||
| ZnS | |||||
| C | |||||
| BN | |||||
| CaO | |||||
| MgO | |||||
| NaCl | |||||
| LiF | |||||
| Ar | |||||
| ME | - | ||||
| MAE | - | ||||
| RMSE | - | ||||
| MARE | - |
| Functional | C | Si |
|---|---|---|
| PBE | ||
| r2SCAN | ||
| SL-GGA | ||
| NL-GGA | ||
| SL-MGGA | ||
| NL-MGGA | ||
| NL-MGGA-PBE | ||
| NL-MGGA-DTR | ||
| HSE06 | ||
| PBE0 |
| Type/Baseline | |
|---|---|
| NL-GGA/PBE | |
| NL-GGA/Chachiyo | |
| NL-MGGA/PBE | |
| NL-MGGA/Chachiyo |
| Model | SL-MGGA | NL-MGGA | NL-MGGA-PBE | NL-MGGA-DTR |
|---|---|---|---|---|
| PBE0/CIDER | 7.86 | 6.59 | 6.62 | 7.32 |
| PW6B95/CIDER | 7.99 | – | 6.58 | 6.29 |
| PBE999This work. | r2SCAN11footnotemark: 1 | SL-MGGA11footnotemark: 1 | NL-MGGA-DTR11footnotemark: 1 | HSE06101010Borlido et al. [42] | PBE022footnotemark: 2 | ||
| ME (eV) | |||||||
| MAE (eV) | |||||||
| RMSE (eV) | |||||||
| MRE | |||||||
| MARE | |||||||
| RMSRE | |||||||
| Code Name | Baseline | Loss | ||
|---|---|---|---|---|
| SL_GGA_VC_000 | Chachiyo | 8 | 0.5 | 7.035046 |
| SL_GGA_VC_003 | Chachiyo | 4 | 0.5 | 7.052635 |
| SL_GGA_VP_008 | PBE | 0.1 | 0.5 | 7.079104 |
| SL_GGA_VC_009 | Chachiyo | 2 | 0.5 | 7.092393 |
| PBE | – | – | – | 10.868648 |
| SL_GGA_VP_000 | PBE | 0.4 | 0.5 | 16.625300 |
| SL_GGA_VP_003 | PBE | 0.2 | 0.5 | 38.890716 |
| Code Name | Baseline | Scheme | Loss | ||
|---|---|---|---|---|---|
| NL_GGA_VC_004 | Chachiyo | S1 | 1 | 0.5 | 3.904467 |
| NL_GGA_VP_004 | PBE | S1 | 1 | 0.5 | 3.928614 |
| NL_GGA_VC_003 | Chachiyo | S1 | 2 | 0.5 | 3.990208 |
| NL_GGA_VP_003 | PBE | S1 | 2 | 0.5 | 4.007883 |
| NL_GGA_VP_005 | PBE | S2 | 2 | 0.5 | 4.057075 |
| NL_GGA_VC_005 | Chachiyo | S2 | 2 | 0.5 | 4.061060 |
| NL_GGA_VC_008 | Chachiyo | S1 | 0.5 | 0.5 | 4.286322 |
| NL_GGA_VP_008 | PBE | S1 | 0.5 | 0.5 | 4.290059 |
| NL_GGA_VP_001 | PBE | S1 | 4 | 0.5 | 4.329216 |
| NL_GGA_VC_001 | Chachiyo | S1 | 4 | 0.5 | 4.342482 |
| NL_GGA_VC_009 | Chachiyo | S2 | 0.5 | 4 | 4.407261 |
| NL_GGA_VP_009 | PBE | S2 | 0.5 | 4 | 4.449937 |
| NL_GGA_VP_000 | PBE | S2 | 4 | 0.5 | 4.568491 |
| NL_GGA_VC_000 | Chachiyo | S2 | 4 | 0.5 | 4.581439 |
| NL_GGA_VC_006 | Chachiyo | S2 | 0.5 | 0.5 | 4.798940 |
| NL_GGA_VP_007 | PBE | S2 | 0.5 | 0.5 | 4.800655 |
| PBE | – | – | – | – | 10.868648 |
| Code Name | Baseline | Loss | ||
|---|---|---|---|---|
| SL_MGGA_VC_006 | Chachiyo | 64 | 0.5 | 3.218994 |
| SL_MGGA_VP_007 | PBE | 3.2 | 0.5 | 3.220336 |
| SL_MGGA_VP_009 | PBE | 1.6 | 0.5 | 3.286767 |
| SL_MGGA_VC_008 | Chachiyo | 32 | 0.5 | 3.303520 |
| SL_MGGA_VC_009 | Chachiyo | 16 | 0.5 | 3.347484 |
| SL_MGGA_VP_008 | PBE | 0.05 | 0.25 | 3.724633 |
| SL_MGGA_VP_004 | PBE | 0.1 | 0.25 | 3.771061 |
| SL_MGGA_VP_002 | PBE | 0.2 | 0.25 | 3.872119 |
| SL_MGGA_VC_007 | Chachiyo | 2 | 0.25 | 3.934145 |
| SL_MGGA_VC_003 | Chachiyo | 4 | 0.25 | 3.995785 |
| SL_MGGA_VP_000 | PBE | 0.4 | 0.25 | 4.085934 |
| SL_MGGA_VC_001 | Chachiyo | 8 | 0.25 | 4.130502 |
| SL_MGGA_VP_001 | PBE | 0.8 | 0.25 | 4.370939 |
| PBE | – | – | – | 10.868648 |
| Code Name | Baseline | Scheme | Loss | ||
|---|---|---|---|---|---|
| NL_MGGA_VC_005 | Chachiyo | S1 | 1 | 1 | 1.895010 |
| NL_MGGA_VP_007 | PBE | S1 | 1 | 1 | 1.930917 |
| NL_MGGA_VC_006 | Chachiyo | S1 | 0.5 | 1 | 2.011226 |
| NL_MGGA_VP_005 | PBE | S1 | 0.5 | 1 | 2.039220 |
| NL_MGGA_VC_000 | Chachiyo | S2 | 1 | 1 | 2.039227 |
| NL_MGGA_VP_000 | PBE | S2 | 1 | 1 | 2.044209 |
| NL_MGGA_VP_003 | PBE | S2 | 1 | 2 | 2.063379 |
| NL_MGGA_VC_003 | Chachiyo | S2 | 1 | 2 | 2.063773 |
| NL_MGGA_VP_006 | PBE | S2 | 2 | 0.5 | 2.100342 |
| NL_MGGA_VC_007 | Chachiyo | S2 | 2 | 0.5 | 2.118458 |
| NL_MGGA_VC_001 | Chachiyo | S2 | 0.5 | 2 | 2.284749 |
| NL_MGGA_VP_002 | PBE | S2 | 0.5 | 2 | 2.313989 |
| NL_MGGA_VP_004 | PBE | S2 | 0.5 | 4 | 2.457190 |
| NL_MGGA_VC_004 | Chachiyo | S2 | 0.5 | 4 | 2.469842 |
| NL_MGGA_VP_009 | PBE | S1 | 0.5 | 4 | 2.504403 |
| NL_MGGA_VC_002 | Chachiyo | S2 | 0.5 | 1 | 2.584111 |
| NL_MGGA_VP_001 | PBE | S2 | 0.5 | 1 | 2.604407 |
| NL_MGGA_VC_009 | Chachiyo | S1 | 0.5 | 4 | 2.620007 |
| NL_MGGA_VP_008 | PBE | S2 | 1 | 4 | 2.620640 |
| NL_MGGA_VC_008 | Chachiyo | S2 | 1 | 4 | 2.665951 |
| PBE | – | – | – | – | 10.868648 |
| Name in Paper | Code Name | Baseline | Scheme | ||||
|---|---|---|---|---|---|---|---|
| SL-GGA | SL_GGA_VC_000 | Chachiyo | 8 | 0.5 | – | – | – |
| NL-GGA | NL_GGA_VC_004 | Chachiyo | 20.0 | 1.0 | S1 | 1 | 0.5 |
| SL-MGGA | SL_MGGA_VC_006 | Chachiyo | 64 | 0.5 | – | – | – |
| NL-MGGA | NL_MGGA_VC_005 | Chachiyo | 1.0 | 1.0 | S1 | 1 | 1 |
| NL-MGGA-PBE | NL_MGGA_VP_007 | PBE | 0.05 | 1.0 | S1 | 1 | 1 |
| NL-MGGA-DTR | NL_MGGA_VP_DTR | PBE | 0.1 | 1.0 | S1 | 1 | 1 |
| PBE | r2SCAN | SL-GGA | NL-GGA | SL-MGGA | NL-MGGA | NL-MGGA-DTR111111Note that all sub-databases contain both train and test data for NL-MGGA-DTR, so the train-test partition does not apply for this functional. | |
| W4-11 (Train) | 0.86 | 0.32 | 0.80 | 0.72 | 0.49 | 0.34 | 0.32 |
| G21IP (Train) | 0.72 | 0.41 | 0.75 | 0.69 | 0.89 | 0.65 | 0.59 |
| BH76 (Train) | 1.02 | 0.44 | 1.06 | 0.77 | 0.68 | 0.52 | 0.51 |
| G21EA (Test) | 0.80 | 0.36 | 0.91 | 0.58 | 0.64 | 0.41 | 0.38 |
| FH51 (Test) | 0.68 | 0.25 | 0.63 | 0.59 | 0.49 | 0.37 | 0.30 |
| DIPCS10 (Test) | 0.61 | 0.28 | 0.62 | 0.48 | 0.52 | 0.30 | 0.27 |
| INV24 (Test) | 0.68 | 0.30 | 0.59 | 0.57 | 0.55 | 0.42 | 0.36 |
| PX13 (Test) | 0.81 | 0.21 | 1.02 | 0.62 | 0.38 | 0.28 | 0.30 |
| PBE | r2SCAN | SL-GGA | NL-GGA | SL-MGGA | NL-MGGA | NL-MGGA-DTR | PBE0 | |
|---|---|---|---|---|---|---|---|---|
| H2 | 0.750 | 0.741 | 0.750 | 0.747 | 0.746 | 0.742 | 0.744 | 0.745 |
| N2 | 1.102 | 1.093 | 1.104 | 1.096 | 1.095 | 1.089 | 1.094 | 1.089 |
| O2 | 1.218 | 1.206 | 1.219 | 1.206 | 1.205 | 1.194 | 1.200 | 1.192 |
| F2 | 1.413 | 1.397 | 1.417 | 1.403 | 1.390 | 1.385 | 1.385 | 1.375 |
| HF | 0.930 | 0.921 | 0.930 | 0.922 | 0.919 | 0.916 | 0.918 | 0.918 |
| CO | 1.135 | 1.126 | 1.139 | 1.129 | 1.130 | 1.123 | 1.127 | 1.122 |
| MAD vs. PBE0 | 0.018 | 0.008 | 0.020 | 0.010 | 0.007 | 0.003 | 0.005 | — |
| PBE | r2SCAN | SL-GGA | NL-GGA | SL-MGGA | NL-MGGA | NL-MGGA-DTR | PBE0 | |
|---|---|---|---|---|---|---|---|---|
| H2 | 104.7 | 107.7 | 104.1 | 105.5 | 104.7 | 104.3 | 104.6 | 104.4 |
| N2 | 243.6 | 220.3 | 223.8 | 222.5 | 221.3 | 223.1 | 223.2 | 225.6 |
| O2 | 143.7 | 129.9 | 129.5 | 121.8 | 125.7 | 122.8 | 126.8 | 124.6 |
| F2 | 52.8 | 38.7 | 45.7 | 35.8 | 34.7 | 34.9 | 38.1 | 35.1 |
| HF | 142.1 | 138.3 | 138.0 | 139.4 | 136.6 | 137.0 | 138.2 | 136.9 |
| CO | 268.9 | 255.6 | 256.0 | 253.0 | 254.8 | 254.8 | 255.6 | 255.6 |
| MAD vs. PBE0 | 12.3 | 3.1 | 3.2 | 2.1 | 1.2 | 0.9 | 1.5 | — |
| PBE | r2SCAN | PBE0/NL-MGGA-DTR | HSE06 [67] | Experimental | |
| Li | 3.437 | 3.484 | 3.473 | 3.466 | 3.477 |
| C | 3.572 | 3.560 | 3.567 | 3.549 | 3.567 |
| GaAs | 5.769 | 5.680 | 5.717 | 5.692 | 5.641 |
| LiF | 4.096 | 4.032 | 4.059 | 4.003 | 4.010 |
| MgO | 4.267 | 4.223 | 4.213 | 4.208 | 4.207 |
| Pt | 3.968 | 3.950 | 3.952 | 3.949 | 3.916 |
| Pd | 3.941 | 3.919 | 3.956 | 3.929 | 3.881 |
| TiC | 4.340 | 4.337 | 4.350 | 4.307 | 4.330 |
| ZrC | 4.703 | 4.709 | 4.707 | 4.700 | 4.696 |
| Rb | 5.672 | 5.745 | 5.834 | 5.756 | 5.585 |
| Na | 4.200 | 4.225 | 4.242 | 4.215 | 4.225 |
| Sr | 6.031 | 6.077 | 6.108 | 6.092 | 6.048 |
| MgS | 5.228 | 5.207 | 5.192 | 5.207 | 5.202 |
| InP | 5.954 | 5.906 | 5.923 | 5.907 | 5.861 |
| AlSb | 6.221 | 6.175 | 6.182 | 6.184 | 6.128 |
| LiCl | 5.156 | 5.133 | 5.096 | 5.107 | 5.106 |
| GaSb | 6.213 | 6.127 | 6.159 | 6.152 | 6.082 |
| Ge | 5.767 | 5.679 | 5.708 | 5.702 | 5.652 |
| AlP | 5.512 | 5.482 | 5.484 | 5.482 | 5.458 |
| InAs | 6.180 | 6.106 | 6.133 | 6.125 | 6.036 |
| Cu | 3.644 | 3.592 | 3.669 | 3.633 | 3.603 |
| Ir | 3.867 | 3.848 | 3.848 | 3.840 | 3.835 |
| Nb | 3.305 | 3.303 | 3.324 | 3.305 | 3.296 |
| Rh | 3.839 | 3.821 | 3.839 | 3.797 | 3.798 |
| Ta | 3.319 | 3.311 | 3.318 | 3.328 | 3.301 |
| ZnSe | 5.732 | 5.656 | 5.712 | 5.700 | 5.668 |
| CdTe | 6.599 | 6.545 | 6.585 | 6.574 | 6.480 |
| HfC | 4.659 | 4.649 | 4.645 | 4.647 | 4.638 |
| TiN | 4.255 | 4.244 | 4.250 | 4.213 | 4.239 |
| ZnTe | 6.169 | 6.097 | 6.149 | 6.152 | 6.104 |
| ZrN | 4.587 | 4.584 | 4.572 | 4.575 | 4.585 |
| Ba | 5.108 | 5.172 | 5.221 | 5.087 | 5.007 |
| Ca | 5.512 | 5.549 | 5.580 | 5.582 | 5.565 |
| Al | 4.041 | 3.991 | 4.016 | 4.020 | 4.032 |
| K | 5.286 | 5.351 | 5.406 | 5.313 | 5.225 |
| GaN | 4.603 | 4.542 | 4.561 | 4.504 | 4.520 |
| GaP | 5.545 | 5.479 | 5.509 | 5.467 | 5.442 |
| BN | 3.626 | 3.611 | 3.616 | 3.601 | 3.607 |
| Si | 5.475 | 5.450 | 5.456 | 5.444 | 5.430 |
| SiC | 4.389 | 4.365 | 4.372 | 4.353 | 4.358 |
| AlAs | 5.736 | 5.683 | 5.694 | 5.694 | 5.652 |
| BP | 4.551 | 4.539 | 4.549 | 4.530 | 4.538 |
| NaF | 4.701 | 4.611 | 4.646 | 4.635 | 4.609 |
| InSb | 6.620 | 6.548 | 6.577 | 6.561 | 6.469 |
| NaCl | 5.696 | 5.616 | 5.622 | 5.645 | 5.595 |
| BAs | 4.815 | 4.787 | 4.801 | 4.781 | 4.777 |
| Sn | 6.641 | 6.560 | 6.591 | 6.580 | 6.482 |
| Ni | 3.530 | 3.498 | 3.554 | 3.512 | 3.513 |
| Fe | 2.844 | 2.874 | 2.926 | 2.897 | 2.861 |
| Ag | 4.147 | 4.111 | 4.192 | 4.153 | 4.069 |
| Mo | 3.160 | 3.155 | 3.165 | 3.145 | 3.144 |
| W | 3.183 | 3.173 | 3.175 | 3.173 | 3.162 |
| Au | 4.177 | 4.148 | 4.169 | 4.153 | 4.065 |
| V | 3.003 | 2.998 | 3.022 | 2.970 | 3.028 |
| HfN | 4.552 | 4.532 | 4.522 | 4.531 | 4.519 |
| VN | 4.132 | 4.115 | 4.120 | 4.073 | 4.135 |
| VC | 4.168 | 4.156 | 4.164 | 4.112 | 4.160 |
| CdS | 5.920 | 5.877 | 5.914 | 5.885 | 5.818 |
| NbN | 4.412 | 4.408 | 4.397 | 4.392 | 4.379 |
| CdSe | 6.188 | 6.130 | 6.173 | 6.147 | 6.050 |
| NbC | 4.478 | 4.478 | 4.472 | 4.461 | 4.470 |
| ZnS | 5.445 | 5.380 | 5.435 | 5.419 | 5.404 |
| Train MAE | 0.050 | 0.020 | 0.031 | 0.022 | - |
| Val. MAE | 0.057 | 0.029 | 0.050 | 0.039 | - |
| Test MAE | 0.058 | 0.036 | 0.052 | 0.038 | - |
| ME | 0.051 | 0.025 | 0.046 | 0.025 | - |
| MAE | 0.057 | 0.031 | 0.048 | 0.036 | - |
| RMSE | 0.071 | 0.046 | 0.070 | 0.050 | - |
| MAXE121212Maximum absolute error. | 0.159 | 0.165 | 0.249 | 0.171 | - |
| MAXE Syst.131313System with the maximum absolute error. | Sn | Ba | Rb | Rb | - |
| Functional | Lattice Constant (Å) | Band Gap (eV) |
|---|---|---|
| PBE | ||
| PBE0/NL-MGGA-DTR | ||
| PBE0(0.3)/NL-MGGA-DTR | ||
| Experimental |
| Defect | Level | PBE | PBE0()/CIDER | Expt. | ||||
|---|---|---|---|---|---|---|---|---|
| Fr. | Pot. | Fr. | Pot. | Fr. | Pot. | |||
| vac | ++/+ | -0.09 | 0.06 | -0.01 | 0.14 | -0.01 | 0.14 | 0.13 |
| vac | +/0 | -0.04 | 0.01 | 0.05 | 0.10 | 0.08 | 0.13 | 0.05 |
| vac | ++/0 | -0.06 | 0.04 | 0.02 | 0.12 | 0.03 | 0.13 | 0.09 |
| vac | 0/- | 0.46 | 0.41 | 0.71 | 0.66 | 0.76 | 0.71 | — |
| vac | -/– | 0.50 | 0.35 | 0.84 | 0.69 | 0.92 | 0.77 | — |
| vac | 0/– | 0.48 | 0.38 | 0.77 | 0.68 | 0.84 | 0.74 | — |
| P | +/0 | 0.53 | 0.58 | 0.99 | 1.03 | 1.08 | 1.13 | 1.12 |
| B | 0/- | 0.07 | 0.02 | 0.07 | 0.02 | 0.08 | 0.03 | 0.04 |
| Cu | +/0 | 0.04 | 0.09 | 0.03 | 0.07 | 0.02 | 0.07 | 0.21 |
| Cu | 0/- | 0.24 | 0.19 | 0.53 | 0.48 | 0.58 | 0.53 | 0.48 |
| Cu | -/– | 0.66 | 0.52 | 0.91 | 0.76 | 0.97 | 0.82 | 1.00 |
| S | ++/+ | 0.12 | 0.27 | 0.47 | 0.61 | 0.54 | 0.69 | 0.58 |
| S | +/0 | 0.41 | 0.46 | 0.83 | 0.88 | 0.92 | 0.97 | 0.85 |
| k-point mesh | Functional | (eV) |
|---|---|---|
| HSE06 | 0.19 | |
| CIDER | 0.15 | |
| MP | HSE06 | 0.25 |
| MP | CIDER | 0.23 |
| HSE06 | 0.10 | |
| CIDER | 0.14 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · X-ray Diffraction in Crystallography · Surface and Thin Film Phenomena
\NewDocumentCommand\MakeTitleInner
+m +m +m
#1
#2
#3
\NewDocumentCommand\MakeTitle
+m +m +m \MakeTitleInner#1#2#3
Nonlocal Machine-Learned Exchange Functional for Molecules and Solids
Kyle Bystrom
Harvard John A. Paulson School of Engineering and Applied Sciences
Boris Kozinsky
Harvard John A. Paulson School of Engineering and Applied Sciences
Robert Bosch LLC Research and Technology Center, Cambridge, MA, USA
Abstract
The design of better exchange-correlation functionals for Density Functional Theory (DFT) is a central challenge of modern electronic structure theory. However, current developments are limited by the mathematical form of the functional, with efficient semilocal functionals being inaccurate for many technologically important systems and the more accurate hybrid functionals being too expensive for large solid-state systems due to the use of the exact exchange operator. In this work, we use machine learning combined with exact physical constraints to design an exchange functional that is both orbital-dependent and nonlocal, but which can be evaluated at roughly the cost of semilocal functionals and is significantly faster than hybrid DFT in plane-wave codes. By training functionals with several different feature sets, we elucidate the roles of orbital-dependent and nonlocal features in learning the exchange energy and determine that both types of features provide vital and independently important information to the model. Having trained our new exchange functional with an expressive, nonlocal feature set, we substitute it into existing hybrid functionals to achieve hybrid-DFT accuracy on thermochemical benchmark sets and improve the accuracy of band gap predictions over semilocal DFT. To demonstrate the scalability of our approach as well as the practical benefits of improved band gap prediction, we compute charged defect transition levels in silicon using large supercells. Due to its transferability and computational efficiency for both molecular and extended systems, our model overcomes the cost-accuracy trade-off between semilocal and hybrid DFT, and our general approach provides a feasible path toward a universal exchange-correlation functional with post-hybrid DFT accuracy and semilocal DFT cost.
††preprint: APS/123-QED
I Introduction
One of the central challenges of density functional theory (DFT) research is to design accurate approximations to the exchange-correlation (XC) functional [1]. As the one unknown term in Kohn-Sham DFT [2], errors in the XC functional limit the predictive power of DFT for a wide variety of technological problems, including catalysis [3], battery materials [4], and semiconductor physics [5, 6]. Increasing the complexity of the XC functional can improve its accuracy, but often at drastically increased computational cost, which can be prohibitive for large condensed matter simulations. In this work, we use machine learning (ML) to overcome this cost-accuracy trade-off by training a functional that obeys exact physical constraints and is accurate, transferable across a broad range of chemistries, and scalable to hundreds of atoms.
The key cost-accuracy trade-off we address is that between semilocal and hybrid XC functionals. Semilocal functionals, which include the local density (LDA), generalized gradient (GGA), and meta-generalized gradient (meta-GGA) approximations [7], are computationally efficient but insufficiently accurate for many applications. For example, LDA and GGA functionals are incapable of correctly predicting band gaps due to their lack of derivative discontinuity [8, 9], and they are of limited accuracy for many molecular systems [10, 1]. Meta-GGAs introduce derivative discontinuities into the XC functional and can therefore improve band gap predictions, but they typically account for only 20-50% of the GGA band gap error [11]. Meta-GGAs can be designed to further improve band gap prediction [12, 13], but this typically makes them less accurate for other chemical and material properties [14, 13]. On the other hand, hybrid functionals, which mix a fraction of the nonlocal, orbital-dependent exact exchange energy into an XC functional [15], provide significantly improved predictions of band gaps [11, 16] and thermochemical properties [10, 1] over semilocal DFT, but this accuracy comes at a steep additional computational cost, especially within plane-wave DFT. In spite of the remarkable recent progress of efficient implementations for solids [17, 18, 19, 20, 21, 22], hybrid DFT is not as efficient or practical as semilocal DFT for these calculations.
Several approaches to design improved XC functionals have been pursued in the literature. Many popular functionals, like PBE [23] and SCAN [24], are designed primarily or entirely based on mathematical constraints obeyed by the exact functional. Constraint-based functionals are often considered trustworthy because they can predict molecular properties without being explicitly tuned to match them, reducing the risk of bias toward a particular property or set of systems [25, 26]. However, some of the most popular and accurate XC functionals for main-group chemistry are empirically fit to chemical data [27, 28]. Recently, significant progress has been made toward combining constraint and data-based design approaches using machine learning [29, 30, 31, 32, 33, 34, 35], culminating in the development of the DM21 functional [36], which achieves state-of-the-art accuracy on main-group chemistry benchmarks and also improves the description of static correlation by fitting fractional charge constraints. However, these developments have a few key shortcomings, the most notable of which is their computational cost. DM21 uses a local range-separated hybrid formalism, which is both computationally expensive (even more so than global hybrid DFT) and not widely available for the plane-wave DFT codes typically used for investigating large extended systems. As a result, DM21 has not been applied to solids. To the best of our knowledge, no nonlocal machine-learned functional has yet been implemented for both molecular (Gaussian-type orbital) and extended (plane-wave) DFT calculations. All ML XC functionals that have been applied to solids have been limited to the semilocal form [37, 38, 39, 33, 13, 32] (with the exception of minimally parameterized nonlocal van der Waals terms). This limitation restricts the application space of ML functionals by preventing their use in applications where hybrid DFT accuracy is needed. For example, hybrid DFT has proven useful for studying problems in catalysis within both Gaussian-type orbital and plane-wave codes [40, 41] and for addressing the “band-gap problem” in DFT [42]. In addition to this technical limitation, applications of ML functionals to complex and heterogeneous systems are limited by their lack of transferability and universality, as they have previously only been trained on molecular systems.
These limitations raise the question of whether machine learning models can leverage more efficient features to balance accuracy and computational cost. One possible approach in this direction is to generate nonlocal features by taking integrals of the density over some set of kernel functions [43, 30]. This way, an ML model of the XC functional can capture nonlocal information about the density without evaluating the computationally demanding exact exchange operator required for hybrid DFT. In line with this strategy, we recently introduced the Compressed Scale-Invariant Density Representation (CIDER) formalism for learning the exchange functional [35], in which the nonlocal features were designed to enforce the uniform scaling rule [44] on the learned exchange functional. We demonstrated that the exchange functional can be accurately learned with the CIDER approach. However, the high computational cost of computing the features, lack of implementation for plane-wave DFT, and need to compute and fit the exchange energy density (rather than the total exchange energy) limited the practicality of our model and restricted its application to small molecules.
In this work, we overcome all of these limitations by revising the form of the features, implementing efficient algorithms to evaluate the new features for both molecular and extended systems, and redesigning our Gaussian process model to fit the total exchange energy. These improvements enable the construction of the first nonlocal ML exchange functional for solids, which can be applied to large systems with similar computational cost to semilocal DFT. We use this exchange functional to implement efficient and accurate hybrid DFT surrogates that can be used in a variety of applications. Because our new functional form is both transferable and numerically stable, we believe that our approach will also be useful for learning the full XC functional, allowing us to address problems where DFT accuracy is limited by both exchange and correlation effects.
By training and testing exchange functionals with different feature sets, we find that to learn the exchange functional well enough to reproduce hybrid DFT accuracy, it is necessary to include both a semilocal orbital-dependent feature (i.e. the kinetic energy density) and nonlocal density-dependent features in the model input. Having developed this “nonlocal meta-GGA,” we show that it accurately predicts molecular properties and improves solid-state band gap predictions over standard meta-GGAs, which is a central problem in DFT. The performance for solid-state band gaps is particularly promising because only 9 systems in the training set are solids, with the others being isolated molecules and complexes. We also encounter only a few convergence problems in our self-consistent field (SCF) calculations of 2462 molecular and 453 periodic systems, indicating that the functional has good numerical stability. To demonstrate the transferability and efficiency of the model, we simulate systems with over 500 atoms to predict charged defect transition levels in silicon.
The paper is organized as follows. Section II describes the features and models used in this work. Specifically, Sections II.1-II.3 cover the design of the exchange functional form, nonlocal features, and Gaussian process models, and Sections II.4-II.6 cover the computationally efficient implementation of the nonlocal features. The feature implementation requires different algorithms for Gaussian-type orbital and plane-wave DFT codes due to the use of different basis sets and integration grids within these codes. Section III presents our results and discusses the accuracy of the CIDER functionals as well as their computational efficiency within plane-wave DFT. Section IV contains our conclusion. Appendix A provides a detailed description of the computational methods used in this study, and Appendix B provides additional mathematical details for the plane-wave implementation of CIDER.
II Theory
II.1 Uniform Scaling and Semilocal Exchange Functionals
The simplest approximate exchange functional is the local density approximation (LDA), in which the exchange energy is written as an integral over a function of the density at each point in space
[TABLE]
where in the Kohn-Sham formalism [2], the density is expressed in terms of the occupied Kohn-Sham orbitals as
[TABLE]
Typically, is defined as the exchange energy density of the uniform electron gas [45]
[TABLE]
In addition to making Eq. 1 exact for the uniform electron gas, Eq. 3 also satisfies an important property of the exchange functional known as uniform scaling [44, 46], which states that for a scaled density distribution ,
[TABLE]
where is a positive scalar.
More sophisticated functionals that obey Eq. 4 can be designed by writing the exchange energy density as the product of Eq. 3 and an exchange enhancement factor :
[TABLE]
In the above equation, is a vector of scale-invariant features , where scale-invariance is defined as
[TABLE]
For example, a generalized gradient approximation (GGA) exchange functional is constructed by making a function of the reduced gradient of the density
[TABLE]
Because satisfies Eq. 6, the GGA form
[TABLE]
satisfies Eq. 4. Likewise, a meta-GGA can be constructed by making the exchange enhancement factor a function of two ingredients , where
[TABLE]
is a scale-invariant quantity that depends on the kinetic energy density
[TABLE]
and and are the single-orbital and uniform electron gas kinetic energies, respectively.
II.2 Updated Nonlocal Features for CIDER
In our initial construction of the CIDER formalism [35], the exchange functional is defined as
[TABLE]
In Eq. 11, the functions are nonlocal features of the density defined by
[TABLE]
where is the kernel function
[TABLE]
with . The position-dependent exponent is defined as a semilocal quantity
[TABLE]
where and are adjustable constants that control how quickly the width of the Gaussian-type function in Eq. 13 decreases as the density and kinetic energy density increase. Equation 14 behaves quadratically under uniform scaling, i.e.
[TABLE]
This property ensures that is scale-invariant (Eq. 6) and therefore that Eq. 11 satisfies the uniform scaling rule (Eq. 4).
Using these scale-invariant nonlocal features, we were able to train an ML exchange functional that accurately reproduced hybrid DFT atomization energies [35], but there is room for improvement in making the features more physically intuitive and easier to implement with production-level performance. In this work, we simplify the features by removing the spherical harmonics and instead adding a second position-dependent exponent to the kernel function, resulting in the new features
[TABLE]
where . Multiple features are generated by evaluating Eq. 16 with different choices of exponent , which in turn are tuned by constants and as discussed below. Because the position-dependent exponents increase as the density increases, the exponent effectively damps the value of the integrand when is near the atomic core, which we intend to limit the interaction between the valence region and core region. The prefactor is chosen so that for the non-spin-polarized uniform electron gas.
To retain the scale-invariance of the features, any choice of and that satisfies Eq. 15 is sufficient. For example, the exponents can be an orbital-independent function of the density and its gradient
[TABLE]
or can be made kinetic-energy dependent, as in our previous work:
[TABLE]
Equations 18 and 19 are used for the nonlocal GGA presented in this work (c.f. Section II.3), while Eqs. 20 and 21 are used for the nonlocal meta-GGA. Throughout Sections II.4–II.6, the factor and the feature index will be dropped for simplicity, as the inclusion of the scaling factor and the introduction of additional features by introducing new and is trivial.
Before continuing, we briefly discuss the construction of the XC potential when nonlocal features are employed. A typical semilocal functional has a multiplicative XC potential
[TABLE]
A meta-GGA also has a non-multiplicative potential
[TABLE]
This above term is non-multiplicative because it depends on the orbitals through , rather than depending exclusively on the density . For a given basis set , the contributions of the XC potential to the effective Kohn-Sham Hamiltonian matrix are
[TABLE]
When nonlocal features are included in the XC energy density, e.g.
[TABLE]
both and include terms that depend on other coordinates :
[TABLE]
In Eq. 26, describes the local dependence of on the density, and describes the nonlocal dependence on the density. While the expressions for and are more complicated than for a meta-GGA, they still contribute to the XC potential matrix exclusively via Eq. 24. Therefore, any DFT code supporting meta-GGAs can evaluate this type of functional if a routine is added to evaluate the nonlocal contributions to Eqs. 26 and 27.
II.3 Gaussian Process Models for the Exchange Functional
II.3.1 Overview
To study the impact of different features on the accuracy of an ML exchange functional, we implement four functional types that use different sets of features.
SL-GGA is a standard, semilocal GGA like PBE [23], except trained via machine learning. The exchange enhancement factor ( in Eq. 5) is a function of only.
NL-GGA is a nonlocal GGA. The exchange enhancement factor is a function of and also three nonlocal features of the form in Eq. 16, with Eqs. 18 and 19 for the exponents.
SL-MGGA is a standard, semilocal meta-GGA like SCAN [24], except trained via machine learning. The exchange enhancement factor is a function of and .
NL-MGGA is a nonlocal meta-GGA. The exchange enhancement factor is a function of , , and three nonlocal features of the form in Eq. 16, with Eqs. 20 and 21 for the exponents.
In short, the SL-GGA and SL-MGGA are semilocal functionals, while the NL-GGA and NL-MGGA augment the semilocal functional form with nonlocal features of the density. As described below, Gaussian process models for the exchange energy are trained using each of these four feature sets. Section II.3.2 explains how a Gaussian process can be fit to total exchange energies of systems, Section II.3.3 lists the specific mathematical form of each feature vector that is used as input to the Gaussian processes, and Section II.3.4 lists the covariance kernels for each feature set.
II.3.2 Fitting the Total Exchange Energy
In this section, we introduce a method to train a Gaussian process to the total exchange energy, as opposed to the nonunique exchange energy density used in our previous work [35]. While we only learn the exchange energy in this work, the approach in this section also applies to the XC energy, so we write the formulas for for the sake of generality. We index the chemical systems in the training set by and ; while the feature implementations are different for Gaussian-type orbital and plane-wave DFT, the role of systems computed within these two frameworks is identical with regard to model training. The XC energy of a system is given by a numerical integral of the energy density
[TABLE]
where indexes grid points, are the quadrature weights, and is the feature vector of molecule at point .
Constructing a Gaussian process for requires defining the covariance between the XC energies for two different systems, i.e. (c.f. Chapter 2.2 of Ref. [47]). The covariance of sums of random variables , and is
[TABLE]
Using Eq. 29, can be expressed in terms of the covariances between the exchange energy densities as
[TABLE]
where is the kernel function for the covariance between the exchange energy densities for feature vectors and . Unfortunately, this expression can be quite expensive to compute because there are tens of thousands of grid points for each small molecule in the training database, and there are several hundred training molecules. Therefore, even for a database of this size, at least evaluations of the kernel are required. To circumvent this problem, a set of control points can be sampled from the training database to use as a basis set for the feature space (as described in Appendix A.6), effectively creating a sparse Gaussian process [47]. Equation 31 is then approximated as
[TABLE]
In this work, the exchange functional alone is trained, and the kernel function is expressed in terms of the exchange enhancement factor . This leads to the adjusted expressions
[TABLE]
where is the covariance kernel for the exchange enhancement factor and is the LDA exchange energy density at coordinate for molecule (see Eq. 3). Having obtained these covariance expressions, they can be plugged into the predictive function for a Gaussian process to obtain the predicted exchange energy density [47]
[TABLE]
where and are the density and feature vector at a test point and is the noise matrix for the training points.
II.3.3 Model Types and Feature vectors
This section lists the features vectors for each model type in terms of , , and the nonlocal features . All of the feature vectors below are designed such that for the uniform electron gas, . Since the exchange enhancement factor is for the uniform electron gas [45], the uniform electron gas constraint is enforced with a single noiseless training point . All functionals in this work satisfy the uniform electron gas constraint.
For the SL-GGA, the feature vector (Eq. 5) is of dimension 1:
[TABLE]
where . For the SL-MGGA, the feature vector is of dimension 2:
[TABLE]
The NL-GGA feature vector is of dimension 4:
[TABLE]
Lastly, the NL-MGGA feature vector is of dimension 5:
[TABLE]
The three nonlocal features are given by Eq. 16, and the parameterization of the kernel exponents (i.e. the constants and in Eqs. 18-21) is described in detail in Appendix A.8.
II.3.4 Kernels
To fully define the Gaussian process model for the exchange energy, we need to specify the kernel function and the noise matrix . The noise matrix is constructed to heuristically approximate the model uncertainty for each data point as described in Appendix A.7. To reflect the different number of features in each functional type, a different kernel is used for for each, as specified below:
[TABLE]
where is a covariance hyperparamter that can be tuned for each model, and the base kernel for all of the above is the squared-exponential kernel
[TABLE]
Equations 54 and 55 use the additive Gaussian process approach of Duvenaud et al. [48].
Because the cost of evaluating Gaussian processes scales linearly with the number of training points, we map the models to cubic splines after training so they can be evaluated efficiently. The semilocal kernels have 1 and 2 features, so they can be mapped to 1 and 2-dimensional cubic splines, respectively. The nonlocal kernels are a sum of 3-dimensional kernels, so they can be mapped to a sum of 3-dimensional cubic splines [35, 49, 50, 51]. For a description of the hyperparameter selection process for these models, see Appendix A.8.
II.4 Auxiliary Expansion of the CIDER Features
The rest of this section focuses on the efficient implementation of nonlocal features within Gaussian-type orbital and plane-wave DFT, which is vital for the practicality of CIDER as well as any nonlocal density functional. Evaluating Eq. 16 via simple numerical integration is computationally expensive due to quadratic scaling with the number of grid points. The key insight to designing a faster approach comes from Román-Pérez and Soler [52], who developed a method to expand van der Waals density functionals as a sum of convolutions that could be computed efficiently using the Fast Fourier Transform (FFT). They observed that a function of the form in Eq. 17, which depends exclusively on the distance between two points and on a semilocal functional of the density at each point, can be approximately expanded as [52]
[TABLE]
In Eq. 57, and are indices which each correspond to interpolation control points and , respectively, and and are interpolating functions which take the values and at the interpolating points. The term .
The Gaussian form of Eq. 17 is particularly conducive to the use of an even-tempered basis for interpolation, in which takes the form
[TABLE]
for some constant . Román-Pérez and Soler [52] used cubic splines for interpolation, and for the solid-state implementation of CIDER we follow this approach, with being a cubic spline in the transformed coordinate . However, for the molecular implementation, we instead treat the control points as a basis set of Gaussians onto which the function is projected:
[TABLE]
As , the number of control points (for the interpolation scheme) and number of basis functions (for the basis expansion scheme) both become infinite, and the kernel expansion becomes exact for both approaches. We determined that the cubic interpolation yields better numerical stability for the plane-wave implementation, while the basis expansion form is sufficiently stable for the molecular implementation.
The benefit of expanding the kernel in this manner is that it allows the nonlocal features (Eq. 16) to be expressed as a sum of convolutions
[TABLE]
with . Within plane-wave DFT, the convolutions can be performed with quasi-linear computational complexity by using FFTs [52]. In addition, because CIDER uses a squared exponential for (Eq. 17), the integrals of Eq. 62 can be evaluated analytically if is projected onto a Gaussian-type orbital basis. We use this approach to design an efficient implementation of nonlocal features for molecular DFT (Section II.5).
The Román-Pérez and Soler technique effectively solves the problem of implementing CIDER for an all-electron plane-wave DFT calculation. However, such calculations are infeasible for all but the smallest atoms because a prohibitively large plane-wave basis would be required to describe the core electrons. In addition, pseudopotentials cannot be reliably used with nonlocal features because calculated on the pseudo-density is not sufficiently similar to the exact, all-electron feature to get accurate results. To solve this problem, in Section II.6 a method is introduced to implement nonlocal features within the Projector-Augmented Wave (PAW) formalism, which requires careful attention to the contributions to from the core regions around the atoms.
II.5 Efficient Implementation of Nonlocal CIDER Features for Molecular DFT
Computing Eq. 16 using simple numerical integration is prohibitively expensive because the number of operations is proportional to the square of the number of grid points. While van der Waals functionals like VV10 [53] are usually computed this way within Gaussian-type orbital codes, the contribution of such functionals is sufficiently small and smooth that a very sparse grid can be used. The nonlocal features require a grid of the same density as the semi-local XC grid, and the number of operations per pair of grid points is higher for evaluating Eq. 16 than, for example, the VV10 kernel. It is therefore necessary to reduce the computational complexity of evaluating the nonlocal features for molecular systems.
In this section, we present a charge-partitioning and density fitting-based approach to compute the nonlocal features within Gaussian-type orbital DFT codes for molecular systems, which we have implemented for use in the PySCF code [54, 55]. Much of this approach is based on the work of Franchini et al. [56], who developed a similar method for computing the Coulomb energy. To start, the density is partitioned into atomic contributions using Becke partitioning [57],
[TABLE]
where is the atom index and are the atomic Becke weights. The first step in evaluating is to project onto spherical harmonics for each atom (with being the combined and indices and being the position of atom ):
[TABLE]
where is the solid angle and are the spherical harmonics. This projection is done numerically using Lebedev grids [58]. Then each channel is projected onto an even-tempered Gaussian basis as follows:
[TABLE]
where is a normalization constant. A transformation to is then performed by multiplying by the inverse overlap matrix of the basis for each channel:
[TABLE]
For each atom, a second even-tempered basis is introduced to serve as a basis for the convolutions of . The following projections are computed using analytical Gaussian integrals:
[TABLE]
The transformation to is then performed by multiplying by the inverse overlap matrix of the basis, similarly to Eq. 69.
The second to last step is to transform to , where are the DFT integration grid points. Because evaluating every orbital of every atom at every would be computationally expensive, is first transformed to a cubic spline , where is a radial coordinate index and indexes the polynomial coefficients of the spline in the interval between points and . The transformation from to is then performed as
[TABLE]
where . The spline index is a step function of . The computational cost of evaluating the splines depends heavily on the efficiency of memory access patterns on the processor, so the coordinates are sorted by before evaluating the contributions to from each atom.
The last and simplest step to compute is to evaluate the sum
[TABLE]
After computing the XC energy from and the semilocal features, the contributions to the XC potential from the terms can then be computed by passing backward through the matrix multiplications and spline evaluations detailed above.
The gradient of the total energy with respect to nuclear positions is required for the evaluation of forces. For this implementation of the CIDER functional with nonlocal features, the energy gradient with respect to the nuclear position of atom is given by
[TABLE]
where is the numerical integration quadrature weight corresponding to the coordinate . The analytical energy gradient for a semilocal functional is given by Johnson et al. [59] Eq. 11.
The computational bottleneck of this algorithm is the spline evaluation in Eq. 72, whose cost scales quadratically with system size. Specifically, the cost scales as , where is the number of integration grid points, is the number of atoms, is the number of kernel interpolation points in Eq. 57, and is the maximum spherical harmonic order used in the basis. All other components of the algorithm are performed on individual atoms and therefore scale linearly with system size.
II.6 Quasi-linear-scaling Implementation of Nonlocal CIDER Features for Plane-Wave DFT with PAW
This section covers our algorithm for evaluating the nonlocal CIDER features within the PAW method, which we have implemented for use in the GPAW code [60, 61]. As discussed above, Eqs. 57 and 62 enable the evaluation of in complexity if the density is represented on an FFT grid. However, in the PAW formalism of Blöchl [62], the all-electron density is represented as
[TABLE]
where indexes the atoms, is the pseudo-density on the FFT grid, and the terms with the 1 superscript are represented on radial support grids centered around each atom and extending to some cutoff radius . Within the cutoff radius of atom , and is the all-electron density. Outside the cutoff radius of atom , .
For a semilocal functional, Eq. 75 yields a simple expression for the XC energy :
[TABLE]
However, the case is not so simple for a nonlocal functional such as CIDER. If Eq. 16 is evaluated on the pseudo-density , the resulting is not equal to the all-electron feature even outside the cutoff radius because the convolutions used to compute the feature are nonlocal. For the same reason, evaluating Eq. 16 on the atomic density does not yield the all-electron feature inside the cutoff radius. Therefore, an implementation of CIDER (and other nonlocal functionals with similar forms, such as some kinetic energy and van der Waals functionals) for PAW must construct a “pseudo-feature” that is smooth but yields the correct feature outside the cutoff radius, as well as construct the “all-electron feature” on the radial support grids inside the cutoff radius. Because of this difficulty, current plane-wave implementations of nonlocal functionals like van der Waals correlation [63] use the pseudo-density rather than the full PAW density. When this approach is used, one may need to use PAW datasets with more valence electrons, increasing computational cost [63]. In addition, because CIDER computes the exchange energy, which is of much larger magnitude than the vdW correlation energy, higher precision is required. A similar problem exists for kinetic energy density functionals. There is an implementation of semilocal kinetic energy functionals with PAW [64], and of nonlocal kinetic energy functionals with ultrasoft pseudopotentials [65], but no nonlocal kinetic energy functional implementation within PAW. Therefore, by designing an algorithm for computing nonlocal features for CIDER within the PAW formalism, we also enable the full integration of vdW functionals and OF-DFT into the PAW method.
Starting from Eq. 75, the all-electron value of outside the core region is given by
[TABLE]
The quantities and are evaluated on the pseudo-density ; and are evaluated on and , respectively. can be evaluated simply using FFTs. To compute , the term is first projected onto spherical harmonics and Gaussians as described in the previous section. Because is localized within the augmentation region and mostly spherical, far fewer spherical harmonics and Gaussians are required per atom than for the Gaussian-type orbital implementation. This projection yields . The Gaussians are then analytically projected into reciprocal space to yield . In reciprocal space,
[TABLE]
The above equation applies because the convolution has angular momentum number . Note that in this section, denoting a symbol as function of implies that it is the Fourier transform of the real-space function corresponding to that symbol. For example, is the Fourier transform of .
Inspired by the PAW method itself, we perform the following trick to approximate on the FFT grid. We introduce two sets of functions, and . The functions form a localized basis (i.e. each function vanishes for ) and are meant to fit the high-frequency components of introduced by the all-electron density inside the core region. For the current implementation of these functions, we use differences between Gaussians and smooth polynomials that match their derivatives at . The other functions are both localized and smooth, and are meant to augment to yield the correct values of outside of the augmentation region. To do this, least-squares linear regression is used to fit the following function to :
[TABLE]
where the coefficients and are determined by the regression. The necessary integrals are actually performed in reciprocal space since Eq. 80 yields rather than . At first glance, this appears to be a complicated and expensive fitting problem, but by separating the basis functions into their individual angular momentum channels and doing a separate regression for each channel, it becomes tractable. The details for this procedure are described in Appendix B.1.
Because the functions are localized, in the complete basis set limit outside the augmentation region. Therefore, by defining
[TABLE]
one can compute
[TABLE]
which satisfies the condition that the pseudo-feature is equal to the all-electron feature outside the augmentation region (within the limit of complete basis sets).
Within the core regions, is projected onto the radial support grids using a set of projectors in a very similar formalism to PAW itself. This quantity will be called , in keeping with the PAW notation, and it takes the form
[TABLE]
In the above equations, and are sets of atom-centered functions satisfying , and vanish for so that the numerical integral of Eq. 88 can be performed efficiently on the FFT grid points inside the cutoff sphere of atom . The symbol represents the FFT grid volume element. This is directly analagous to the projector-partial wave pairs in PAW [62]. The construction of and is described in Appendix B.2. is an approximate convolution of , and while arbitrary, it helps make the expansion more efficient. Using this construction, the all-electron and pseudo-features on the radial support grids are, respectively,
[TABLE]
The exchange correlation energy is then
[TABLE]
where all three terms are semilocal functionals of the density and of the nonlocal features. The first term is evaluated on the FFT grid and the second two on the radial support grids on the atoms.
The use of nonlocal features introduces a few additional terms to the forces and stresses not present for semilocal DFT. The force terms are
[TABLE]
The stress terms are
[TABLE]
In the above equations, and refer to the force and stress equations for semilocal functionals within the PAW formalism (see Section IIID of Ref. [66]).
There are several significant contributions to the computational cost of the above approach for evaluating , which all scale linearly or quasi-linearly with system size. The first contribution comes from computing the Fourier transforms of and (as well as the functional derivatives with respect to these quantities), which is necessary to compute the convolutions like those in Eq. 86 in reciprocal space. The cost of these operations scales as , where is the number of kernel interpolation points in Eq. 57 and is the size of the FFT grid used for integrating the XC energy. The second contribution arises from computing the convolutions in Eq. 86 in reciprocal space, and scales as . The third contribution comes from computing the PAW correction to (the sum in Eq. 84). This and related terms scale as , where is the number of FFT grid points inside the core region of an atom and is the number of atoms. The last contribution consists of all the other PAW correction-related routines (e.g. computing ), which depend only on the density on the atomic radial grids and therefore scale linearly with . The overall scaling with system size is then . Notably, none of the components of the algorithm scale with the number of k-points.
III Results and Discussion
Using Gaussian process regression, we fit each of the model types in Section II.3 (SL-GGA, NL-GGA, SL-MGGA, and NL-MGGA) to match the exact exchange energy on subsets of the GMTKN55 [10] and SOL62 [67, 33] databases, as well as a few dozen other training points. The GMTKN55 database contains 1505 main-group molecular reactions including covalent and non-covalent bond energies, isomerization energies, and barrier heights. The SOL62 database consists of the cohesive energies of 62 solids, including metals and non-metals. For all training systems, the reference exchange energy was determined by computing the exact exchange energy of the orbitals obtained from PBE calculations. We describe the training, validation, and test partitions, training data collection, and validation criteria in Appendices A.1-A.8.
In Section III.1, we demonstrate that the NL-MGGA is capable of learning the exchange functional more accurately than the other feature sets, and in Section III.2 we use 1 and 2-electron systems to provide intuition for why this is. Next, we fine-tune the accuracy and numerical stability of the NL-MGGA (Section III.3) and demonstrate that the resulting exchange functional can be used to match the accuracy of hybrid DFT (Section III.4).
Having designed an accurate exchange functional, we demonstrate that it can be used to improve band gap predictions over standard meta-GGAs in Section III.5. In Section III.6, we show that for large systems in plane-wave DFT, our implementation of the NL-MGGA form is similarly efficient to semilocal meta-GGAs and orders of magnitude more efficient than hybrid DFT. Lastly, in Section III.7, we take advantage of our new functional’s efficiency and accurate band gap predictions to compute charge transition levels of point defects in silicon, which is a notoriously challenging task for DFT due to the underestimation of band gaps and the large supercell sizes required to simulate isolated point defects [6].
III.1 Accuracy of Different Model Types for Molecular and Solid-state Benchmarks
We assess the accuracy of our exchange functionals by substituting them for exact exchange in hybrid functionals to create a “surrogate hybrid” functional. For example, one of the simplest and most popular hybrid functionals is PBE0 [68], which mixes a 1/4 fraction of exact (Hartree-Fock, HF) exchange into the PBE functional [23]:
[TABLE]
In this section we perform so-called PBE0/CIDER calculations, where the exchange-correlation energy is
[TABLE]
One can also generalize PBE0/CIDER with different fractions of CIDER exchange, giving the PBE0()/CIDER form, which is used in later sections:
[TABLE]
All ML functional results in this section refer to the PBE0/CIDER surrogate hybrid form. Because our exchange functionals are trained on exact exchange energies, they are “perfect” if they exactly reproduce the predicted reaction energies of PBE0 on the test set. To measure the error of the exchange functionals, we calculate the average error between PBE0/CIDER and PBE0 for each functional type on GMTKN55 and SOL62. All calculations on GMTKN55 are SCF calculations, but the SOL62 cohesive energy CIDER and PBE0 calculations are performed non-self-consistently with PBE orbitals. See Appendices A.2 and A.3 for details.
The average deviations of the machine-learned functionals from PBE0, using the mean of means (MoM) metric (i.e. the mean of the mean absolute deviations, or MADs, for the subdatabases of GMTKN55), are presented in Fig. 1. Results for all functionals include D4 dispersion corrections [69], with the PBE0/CIDER functionals using the PBE0 dispersion parameters. The nonempirical PBE and r2SCAN [70] functionals, while not designed to reproduce PBE0, are included in Fig. 1 to show typical deviations between GGAs, meta-GGAs, and hybrids.
While the trends vary slightly in the different subsets of GMTKN55, the key trend is that the quality of the fit to exact exchange improves going from SL-GGA to NL-GGA to SL-MGGA to NL-MGGA. Notably, the SL-GGA and NL-GGA suffer from out-of-distribution transferability issues, resulting in larger deviations from PBE0 even compared to the PBE functional on the test set. On the other hand, the ML meta-GGAs are more transferable outside the training data than the ML GGAs, so they match PBE0 more closely than PBE and r2SCAN on the test set. The NL-MGGA is more accurate than the SL-MGGA, achieving an MoM of less than 1 kcal/mol on the test set. The fact that the NL-MGGA outperforms both the NL-GGA and SL-MGGA suggests that the kinetic energy and nonlocal features provide non-redundant information, meaning that nonlocal density features cannot replace the kinetic energy density nor vice versa. As shown in Fig. 2, the same general trends apply for SOL62; the NL-MGGA is the most accurate and transferable ML functional for reproducing PBE0, followed by the SL-MGGA and then the NL-GGA.
Because GMTKN55 contains a variety of properties and systems, the performance of the functionals is somewhat dependent on the choice of error metric. However, the general trends are the same as for the MoM metric. See Supplemental Material Section S2.1 for a discussion of the impact of error metric on the results.
III.2 Impact of Nonlocality and Derivative Discontinuity on Model Accuracy
Overall, the SL-MGGA model is more accurate than the NL-GGA model across the databases tested here. However, the fact that NL-GGA improves over SL-GGA and that NL-MGGA improves over SL-MGGA suggests that the nonlocal features contribute useful, non-redundant information about the density for exchange energy prediction. This raises the question: What types of systems require the nonlocal features, and which require the semilocal orbital dependence of the kinetic energy density? We hypothesize that the derivative discontinuity of the exchange functional [71] plays a key role. Because only the kinetic energy density—and not the nonlocal features of the NL-GGA—includes a derivative discontinuity passing through integer electron number, the NL-GGA (as well as any smooth, pure density functional) cannot be expected to accurately capture energy differences involving a change in electron number, such as ionization energies and electron affinities. On the other hand, existing semilocal meta-GGAs are fundamentally limited by their locality and cannot be expected to improve over GGAs for energy differences between systems of like electron number. This is especially true of one-electron systems, for which the kinetic energy is simply determined by the density gradient.
To explore this hypothesis, we set up two simple sets of systems. The first set is one-electron systems for different arrangements of protons: the hydrogen atom, the \cfH2+ molecular ion, the linear \cfH3^2+ molecular ion, and the triangular \cfH3^2+ molecular ion, all with bond lengths of 0.7 Å, as illustrated in Fig. 3. Because each system contains only one electron with different degrees of delocalization, we expect that only the functionals with nonlocal features can accurately predict energy differences between these systems. Table 1 shows the MAD between the ML exchange functionals and Hartree-Fock for the energy differences between the hydrogen atom and the molecular ions. As expected, the SL-GGA and SL-MGGA are both less accurate than NL-GGA and NL-MGGA. This confirms that the nonlocal features successfully characterize single-electron delocalization in a way that is not feasible with semilocal functionals.
The second set of systems consists of the \cfHe+ ion, the \cfHe singlet (closed-shell ground state), and the \cfHe triplet (open-shell excited spin state with two same-spin electrons). Because the exchange energy has a discontinuous derivative as the number of electrons of a given spin passes through an integer [72, 71], the energy differences between \cfHe+ and \cfHe triplet and between \cfHe singlet and \cfHe triplet can only be described accurately by functionals that capture this derivative discontinuity. As shown in Table 1, the absolute deviations from Hartree Fock for the energy differences between these three systems are larger for SL-GGA and NL-GGA than for SL-MGGA and NL-MGGA. This indicates that the derivative discontinuity included in the kinetic energy density is essential for capturing energy differences related to electron addition/removal. Notably, the NL-MGGA is the most accurate of the four functional types for both datasets.
These observations regarding the role of the derivative discontinuity explain why a previous attempt at machine learning a GGA resulted in more accurate atomization energies but less accurate ionization potentials than PBE [31]. Likewise, the role of nonlocality explains why Kovács et al. [13] found that existing meta-GGAs, as well as 25 meta-GGAs they empirically trained, suffer from a trade-off between cohesive energy accuracy and band gap accuracy. Meta-GGAs are limited in their ability to describe both these properties at once due their semi-locality. In order to learn a nonlocal, orbital-dependent quantity like the exchange energy, an ML functional must contain both orbital dependence and nonlocality in its feature set.
III.3 Fine-tuning for Numerical Stability
Because XC functionals are used within SCF calculations, they must be numerically stable and consistently achieve electronic convergence on real systems to be useful for practical applications. For the GMTKN55 dataset, we found that some systems do not quite converge to our default convergence thresholds, which were Hartree atomic unit (Eh) for the total energy and Eh for the orbital gradient of the energy. (See Appendix A.9 for more details on how we remedied convergence problems.) We found that these calculations could only be converged with a Eh energy threshold and Eh gradient threshold. Interestingly, this problem is least severe for the NL-MGGA, as illustrated in Table 2, which shows the number of systems for each functional type which did not converge to the Eh energy and Eh gradient thresholds. We believe this is because the increased model dimensionality of the NL-MGGA allows it to fit the exchange energy with a smoother function of the input features.
Because the NL-MGGA still sees some convergence problems, we decided to explore two modifications of the NL-MGGA for improved numerical stability. The first is the NL-MGGA-PBE, which has the same hyperparameters and feature shape as the NL-MGGA but uses the PBE functional as a baseline rather than the Chachiyo functional [73]. NL-MGGA-PBE was screened during the validation procedure (Appendix A.8) but was found to be slightly less accurate than the NL-MGGA with the Chachiyo baseline. However, only 1 system sees a convergence issue in Table 2 with NL-MGGA-PBE, compared to 3 with NL-MGGA. We also tried training an NL-MGGA model with the PBE baseline and a more chemically diverse subset of GMTKN55, as described in Appendix A.5. We refer to this new train/validation/test partition as DTR (for “diverse training”), and it yields the NL-MGGA-DTR, for which all GMTKN55 systems achieve an orbital gradient convergence of Eh or lower. In fact, all but 10 of the 2462 GMTKN55 systems converge to within Eh for energy and Eh for orbital gradient with NL-MGGA-DTR, and the rest to within and Eh, respectively.
As shown in Fig. 4, which displays MoM errors of PBE0/CIDER versus PBE0, the accuracy of these three NL-MGGA variants is similar across the GMTKN55 dataset. The three functional variants also have nearly identical performance for the SOL62 cohesive energies, with MADs compared to PBE0 of 0.13 eV/atom for all three. Therefore, for the remainder of this work, we adopt the NL-MGGA-DTR model as our primary functional of choice for applications, and we name it CIDER23X-NL-MGGA-DTR.
III.4 Obtaining Hybrid DFT Accuracy
In this section, we show that by substituting NL-MGGA-DTR for exact exchange in hybrid functionals, accurate predictions of the XC energy can be obtained for molecular systems without additional parameter tuning. In particular, we look at PBE0/CIDER (Eq. 97) and PW6B95/CIDER. The PW6B95 functional [74] is
[TABLE]
where is a meta-GGA term. Therefore, PW6B95/CIDER is
[TABLE]
For PBE0/NL-MGGA-DTR, PW6B95/NL-MGGA-DTR, and several semilocal [23, 75, 70, 76] and hybrid [27, 68, 74, 28, 77] functionals, Fig. 5 shows the MoM error on all of GMTKN55 as well as the mean absolute error (MAE) on the test partition used for NL-MGGA-DTR. The errors are relative to the high-accuracy benchmark data collected in the GMTKN55 database [10]. All functionals include D4 dispersion corrections [69] except for B97M-V [76], B97X-V [28], and B97M-V [77], which all use VV10 dispersion [53]. PBE0/CIDER calculations use PBE0 dispersion parameters, and PW6B95/CIDER calculations use PW6B95 dispersion parameters. We show conventional semilocal functionals in shades of blue, conventional hybrid functionals in orange, and CIDER functionals in green.
As Fig. 5 shows, the PBE0/NL-MGGA-DTR functional is comparable in accuracy to r2SCAN, but the combination of systematic errors in the NL-MGGA-DTR functional and in PBE0 itself prevent it from outperforming the most accurate meta-GGA (B97M-V) or any of the hybrid functionals. On the other hand, because PW6B95 is much more accurate than PBE0 on the GMTKN55 dataset, PW6B95/NL-MGGA-DTR is more accurate than all semilocal functionals explored here. It is also more accurate than the PBE0 and B3LYP [27] hybrid functionals and equivalent in accuracy to the range-separated hybrid B97X-V [28]. This shows that the NL-MGGA-DTR exchange functional can be used to calculate molecular reaction energies with an accuracy that was previously not feasible without exact exchange mixing.
As with Section III.1, the analysis of the errors on GMTKN55 depends somewhat on the error metric used. These nuances are discussed in further detail in Supplemental Material Section S2.2, which also provides a comparison between the nonlocal and semilocal meta-GGA models.
III.5 Using CIDER for Improved Band Gap Prediction
The band gap problem [78, 79], in which semilocal DFT tends to drastically underestimate the band gaps of solids, stands in the way of a variety of important applications like high-throughput screening for optical and electronic properties and semiconductor point defect physics [5, 6, 80]. By introducing a derivative discontinuity into the XC functional, meta-GGAs partly mitigate this problem [81], but they still tend to significantly underestimate band gaps [82]. Exceptions include mBJ [83] and TASK [12], which each have important drawbacks. The mBJ functional provides the exchange potential only, not the energy [83], and TASK does not predict atomization energies and lattice constants as well as SCAN [12, 13, 14]. This reinforces the discussion in Section III.2 regarding the accuracy trade-offs meta-GGAs must make due to their semi-locality.
By introducing nonlocality into the exchange energy, PBE0/NL-MGGA-DTR (Eq. 97 or Eq. 98 with ) improves band gap predictions over PBE and r2SCAN, as shown in Table 3. This improvement occurs even though the CIDER functionals are not fit to solid-state band gaps. The PBE0/SL-MGGA also improves band gap predictions over r2SCAN, but significantly less so than PBE0/NL-MGGA-DTR. The database used for testing covers 255 of the 472 band gaps in the database published by Borlido et al. [42], with La, Yb, and Th-containing systems being excluded as well as systems with a greater than eV difference between the spin-orbit-corrected and non-spin-orbit-corrected band gaps (as computed with PBE by Borlido et al. [42]). See Appendix A.11 for details, and see Supplemental Material Section S3 for error metrics for all 453 solids in the database that do not contain La, Yb, or Th.
To provide a clearer picture of the improved band gap prediction provided by the NL-MGGA-DTR functional, Table 4 shows the band gap predictions for a selected subset of systems that are commonly studied [11]. In addition, Figs. 7 and 7 contain band structures for silicon (Si) and boron phosphide (BP), respectively, which show that PBE0/NL-MGGA-DTR provides similar band structures to PBE and r2SCAN but with a wider band gap.
There is room for improvement in these band gaps, since PBE0/NL-MGGA-DTR still systematically underestimates band gaps compared to the state-of-the-art HSE06 [84, 85] range-separated hybrid functional, as well as the PBE0 functional on which PBE0/CIDER is built. One possible solution is to tune the fraction of CIDER exchange, which is already done with exact exchange within hybrid DFT to optimize functional accuracy for different applications [86]. As shown in Table 3, PBE0()/NL-MGGA-DTR with has increased band gaps over and similar error statistics to HSE06. However, one drawback of this approach is that because NL-MGGA-DTR increases band gaps less than exact exchange for a given mixing fraction, one might need a large fraction of NL-MGGA-DTR to fit band gaps. This large fraction might not provide accurate predictions of other material properties. Alternative approaches to improve CIDER band gaps include using physical intuition and exact constraints to tune the derivative discontinuity contribution from the kinetic energy density (as is done in the TASK functional [12]), introducing more nonlocal features to improve the accuracy of the functional in general, and training on fundamental band gaps (i.e. differences between ionization potentials and electron affinities). However, considering the drastic increase in computational cost associated with using HSE06 or PBE0 compared to semilocal DFT, the ability to mitigate the underestimation of band gaps without explicitly fitting them—and while accurately predicting other properties like bond energies—already serves as a major step toward solving the band gap problem.
III.6 Benchmarking Performance on Large Condensed Matter Systems
The key benefit of the nonlocal features used to learn the exchange functional is that the quasi-linear-scaling algorithm used to compute them within plane-wave DFT (Sections II.4 and II.6) is over an order of magnitude faster than the evaluation of exact exchange for large systems and does not depend on the number of k-points. To illustrate this efficiency, Fig. 8a shows the wall time to perform a ground-state SCF calculation for a 216-atom diamond supercell with semilocal and PBE0/CIDER functionals in GPAW [60, 61]. These wall times are compared to those of the hybrid DFT implementation in JDFTx [87], which uses the ACE algorithm [17]. The CIDER functionals, including those with nonlocal features, all take less than 170 seconds (compared to 60 seconds for PBE and 79 seconds for r2SCAN). When run on the same number of processors, PBE0 is roughly 700 times slower than the NL-MGGA-DTR functional. Even the GPU-accelerated evaluation of PBE0 is about 40 times slower than NL-MGGA-DTR.
The efficiency of CIDER functionals for large systems makes it practical to perform more complicated calculations like defect formation energies. Figure 8b shows the wall time for computing the neutral vacancy formation energies in diamond and silicon using 216-atom supercells for semilocal and PBE0/CIDER functionals. All of the functionals are of comparable computational cost, with the NL-MGGA-DTR model being only 25% slower than r2SCAN for diamond and 50% slower for silicon. This small decrease in efficiency comes with a significant benefit: As shown in Table 5, of the functionals tested, the NL-MGGA model and its variants give the closest match to PBE0 for predicting these vacancy formation energies, with PBE0/NL-MGGA-DTR giving deviations of 0.18 eV and 0.23 eV for diamond for silicon, respectively. The HSE06 and PBE0 formation energies were computed using VASP [88, 89, 90]. See Appendix A.12 for further calculation details.
III.7 Charged Defect Transition Levels in Silicon
One of the consequences of the band gap problem is that transition levels in charged defects are not accurately described by semilocal DFT, so hybrid DFT is typically used when accurate transition level predictions are needed. In this section, we show that by improving band gap prediction, CIDER is able to accurately predict defect transition levels in silicon. The transition levels associated with the single vacancy [92], as well as the boron [93], phosphorus [94], copper [95], and sulfur [96, 97] substitutionals, were computed as described in Appendix A.13, using a 512-atom supercell, -centered k-point mesh, and full structural relaxation for each defect and functional.
Figure 9 shows the computed transition levels compared to experiment with two different finite-size correction schemes: the Freysoldt finite-size corrections [91] (Fig. 9a) and the potential alignment correction (Fig. 9b). We computed the transition levels with PBE and PBE0(0.3)/NL-MGGA-DTR (Eq. 98 with ). The latter functional was chosen because its band gap matches the 1.17 eV experimental gap; however, PBE0/NL-MGGA-DTR yields similar results (see Supplemental Material Section S7). Supplemental Table S13 provides a more detailed list of all computed transition levels. In the rest of this section, PBE0(0.3)/NL-MGGA-DTR is referred to simply as CIDER for brevity.
PBE and CIDER both give reasonable descriptions of the ++/0 transition level in the silicon vacancy and the boron acceptor level. However, the phosphorus donor level is not well-described by PBE due to its severe underestimation of the band gap; CIDER remedies the band gap underestimation and therefore predicts a transition level in good agreement with experiment. To correct for band gap underestimation, one could predict defect levels from PBE by scaling the gap (and with it, the transition levels) to the experimental gap, but this makes it difficult to distinguish true shallow levels from deep levels that happen to sit near the (too-low) PBE conduction band edge [5]. For example, PBE predicts a transition level of 0.66 eV for the -/– transition of (Fig. 9a), 0.05 eV above the conduction band minimum. Without further information, it is unclear whether this state is resonant in the conduction band or localized in the gap. CIDER, on the other hand, places the -/– level inside the gap at 0.97 eV, in good agreement with the 1.00 eV level from experiment.
More generally, Fig. 9a shows that paired with the Freysoldt correction, CIDER matches each level in the figure with an error of 0.10 eV or less compared to experiment, with the exception of the +/0 level in . For this transition, the experimental level is 0.20 eV [95], while the CIDER level is 0.02 eV with the Freysoldt correction and 0.07 eV with the potential alignment correction only. The HSE06 level was previously reported to be 0.21 eV by Sharan et al. [98], in good agreement with experiment. However, Sharan et al. used a 64-atom supercell with a Monkhorst-Pack k-point mesh. This supercell size is relatively small, and the k-point mesh is both small and fails to sample the band edges, which can interfere with the description of relatively delocalized defect states. When we recalculated the CIDER transition level using Sharan et al.’s supercell size and k-point mesh, we obtained 0.18 eV with Freysoldt corrections, in good agreement with their result and with experiment. In the Supplemental Material (Section S7.1), we provide further evidence that the HSE06 level itself is not converged with respect to k-point mesh. This finding illustrates the importance of using well-converged supercell sizes and k-point meshes in point defect calculations, a requirement that is made much easier by the efficiency of the CIDER functionals compared to hybrid DFT. Why this particular transition level is underestimated by both hybrid DFT and CIDER is an interesting question warranting future investigation.
In addition to yielding more accurate transition levels, CIDER predicts somewhat different charge distributions than PBE as well. Figure 10 shows the charge density differences between charge states for several of the transition levels in Fig. 9. There are a few notable takeaways from these plots. First, as expected, the charge density differences are larger for transitions that result in large ionic displacements (as indicated by the MaxDisp in the bottom of each plot). Second, these ionic displacements and resulting charge distortions are noticeable more than 5 Å away from the defect site for some transitions, indicating the importance of large supercells. Lastly, a slightly larger fraction of the charge associated with the transition is contained within a 5 Å sphere around the defect site for CIDER than for PBE (as indicated by the Loc. metric in the bottom of each plot), suggesting that CIDER localizes the defect charge more effectively than PBE. This is reassuring because unphysical charge delocalization is one of the key driving factors of erroneous charge transition level predictions in semiconductor point defects [5].
In summary, our results indicate that the CIDER approach could be a route toward high-accuracy defect calculations at similar cost to semilocal DFT, which would significantly broaden the potential scope of point defects research. In particular, it could be used to improve the accuracy of high-throughput point defects studies [99], which are currently only computationally feasible with semilocal functionals like PBE.
IV Conclusion
We have demonstrated that by combining semi-local orbital-dependent and non-local density-dependent features, it is possible to design functionals with comparable accuracy to hybrid DFT at a drastically reduced computational cost for plane-wave DFT calculations. We can do this by learning exact exchange energies, without the use of any high-accuracy reference data from experiments or wave function theory. We have shown that our model has higher accuracy than semi-local functionals on both molecular benchmarks and solid-state band gap prediction, and that these improvements can assist with applications such as semiconductor point defect calculations.
The developments presented here serve a variety of purposes. As is, the exchange functional we designed can be used as an exact exchange surrogate in scenarios where hybrid DFT is impractical, such as large extended systems. This expands the scope of calculations that can be performed at hybrid DFT accuracy. In addition, the techniques we have introduced to learn smooth, physically constrained functionals and implement efficient nonlocal features could be used in other contexts, such as orbital-free DFT [100] and van der Waals functional design [52, 63, 101].
Importantly, the ability to systematically vary the descriptor complexity in our machine learning approach allows us to extract physically nontrivial information about the nature of the exchange interactions. Specifically, we find that both nonlocality and derivative discontinuity are independently important for the structure of the exchange energy density functional. This clarifies a key limitation of existing semilocal functionals.
Lastly, our approach can be generalized to learn the full XC functional, providing a path to highly transferable and universal functionals capable of describing challenging and heterogeneous chemistry. We anticipate that these developments will contribute to significantly improving the cost-accuracy trade-off of DFT, accelerating the pace and efficacy of materials and chemical research by providing a stronger theoretical backing in various sub-fields.
Appendix A Computational Details
A.1 Overview
To explore the effectiveness of different features for learning the exchange energy, four types of functional were trained, with the model and feature vector details provided in Section II.3:
- •
SL-GGA: A semilocal GGA model based on the reduced squared gradient (Eq. 7).
- •
NL-GGA: A nonlocal model based on and three nonlocal features with Eqs. 18 and 19 for the exponents.
- •
SL-MGGA: A semilocal meta-GGA model based on and the iso-orbital indicator (Eq. 9).
- •
NL-MGGA: A nonlocal model based on , , and three nonlocal features with Eqs. 20 and 21 for the exponents.
The purpose of training semilocal GGA and meta-GGA models is to test the expressive power of the nonlocal descriptors. In our previous work (see Table S1 of Ref. [35]), we found that a standard, semilocal meta-GGA could not effectively learn the exchange energy density, but in this work, we determined that a semilocal meta-GGA can learn the total exchange energy much more effectively than it can learn the exchange energy density.
The two primary datasets used in this work were the GMTKN55 [10] and SOL62 [67, 33] databases. The GMTKN55 database contains 1505 molecular reaction energies based on 2462 single-point calculations, covering five categories: small-molecule properties, larger molecule reactions and isomerization, barrier heights, intermolecular noncovalent interactions, and intramolecular noncovalent interactions. The SOL62 dataset consists of 42 non-metals and 20 metals as categorized by Zhang et al. [67]. We follow Trepte and Voss [33] in excluding Pb and Th from Zhang et al.’s initial set of 64 solids. Twelve subsets of GMTKN55 and half of SOL62, along with a few other systems, were used for training and validation, as detailed in Section A.4.
Throughout this work, we used pymatgen [102] and ase [103] for structure manipulation and calculation setup, fireworks [104] for automated workflow management, and Scikit-learn [105] for the Gaussian process models. The computational details for the PySCF and GPAW calculations are provided in Appendices A.2 and A.3, respectively. Appendices A.4 and A.5 describe the training and validation set selection, and Appendix A.6 covers the selection of control points for the sparse Gaussian processes. Appendix A.7 describes the selection of noise hyperparameters and training/validation loss functions, and Appendix A.8 describes the list of hyperparameters tested for each model and the model selection procedure. Appendix A.9 describes how we handled convergence issues for CIDER models on GMTKN55, and Appendix A.10 describes the PySCF and GPAW settings that are specific to the CIDER functionals. Appendix A.11 describes the band gap benchmark methods. Appendix A.12 describes the computational cost benchmarking of CIDER for plane-wave DFT calculations. Finally, Appendix A.13 describes our methodology for computing charged defect transition levels.
A.2 Molecular Calculations and Reference Data Generation
Molecular systems, including the nanoclusters, were computed using the PySCF code [54, 55]. All calculations were performed using the def2-QZVPPD basis set (with effective core potentials for atoms larger than Kr) [106, 107] for the atomic orbitals and the def2-universal-jkfit auxiliary basis set [108] for the classical Coulomb energy. The seminumerical exchange (SGX) module in PySCF, which is similar to the COSX algorithm [109], was used to compute the exchange energy for hybrid functionals. For SGX, level 1 PySCF grids were used with P-junction screening turned off. For a given sub-database in GMTKN55, restricted DFT was used if all systems in that sub-database had ; otherwise, unrestricted DFT was used. The exception was the \cfPCl_3 transition state in the INV24 database, for which spin symmetry breaking occurs for some functionals. This system was calculated with unrestricted DFT as well. For SCF calculations, the nonlocal features and XC energy were computed on level 3 PySCF grids.
For benchmarking comparison, the GMTKN55 database was evaluated with the above settings using the PBE [23], revPBE [75], r2SCAN [70], PBE0 [68], B3LYP [27], and PW6B95 [74] functionals, all with D4 dispersion corrections [69], as well as with the B97M-V [76], B97X-V [28], and B97M-V [77] functionals.
To collect the exact exchange energy training data, the ground-state orbitals from the PBE calculations were used to compute the exact exchange energy, and the feature vectors were computed from the corresponding PBE densities. The features were computed on level 1 PySCF grids. For training set feature generation only (not the SCF calculations), the integral in Eq. 16 was computed numerically with the coordinate evaluated on level 3 PySCF grids without pruning. This dense grid is necessary for numerical stability and precision in the core region.
A.3 Solid-State Reference Data Generation
The solid-state densities and ground-state energies of the SOL62 dataset were computed in GPAW [60, 61] using the PBE geometries obtained by Trepte and Voss [33], with an energy cutoff of 1000 eV and the same k-point meshes as Trepte and Voss. The parameter in GPAW was set to the minimum value required to avoid aliasing of the density, i.e. 0.096 Å. Fermi-Dirac smearing with a width of 0.01 eV was used for periodic systems. For isolated atoms, fixed occupations were used unless numerical instabilities occurred as a result, which was the case for six atoms (Pb-2, Pd-2, Pt-2, Sn-2, Ta-5, and V-5, where the number is the ground-state magnetic moment ). In these cases, Fermi-Dirac smearing with a width of 0.002 eV was used. The default Davidson solver was used for all solids except the alkaline earth metals, which were solved with the conjugate gradient (CG) method. The CG method was used for all isolated atomic systems. To compute cohesive energies for training, the experimental ground state magnetic moments were used for the isolated atom references. The convergence threshold was set to eV per valence electron for both isolated atoms and periodic systems. The density convergence threshold was set to per electron for periodic systems and per electron for isolated atoms. The eigenstates convergence threshold was set to eV2/electron for periodic systems and eV2/electron for isolated atoms. The cell size for the isolated atoms was chosen to be about 11.7 Å across, with perturbations included to break symmetry. Compared to a larger box size of 17.5 Å, the PBE atomic energies had a root mean square deviation of 0.012 eV and maximum deviation of 0.080 eV (for Zr). Because of the small average deviation, and the fact that the train and test reference values were computed on the atom-in-a-box systems rather than determined by experiment, the smaller boxes were used to lower the computational cost and memory requirements of training and testing. Symmetry constraints were also turned off for the isolated atom calculations.
Training the CIDER models to the solid-state data required computing the exact exchange energy and the nonlocal features for each system. To obtain exact exchange energies, the hybrid module of GPAW was used to compute the non-self-consistent exact exchange energy (EXX) from the PBE orbitals obtained in the above calculations. However, because the k-point meshes above are too dense for evaluation of EXX to be computationally feasible, the periodic systems were first recomputed with a coarser k-point mesh (for each system, the smallest even, -centered mesh with a linear k-point density of at least 4.5 k-points per Å*-1*), and the EXX was computed on this mesh. All nonlocal density features were evaluated on the density and orbitals obtained from the denser k-point mesh. They were evaluated with the internal CIDER settings and (see Section A.10 for details).
The PBE0 cohesive energies were computed using the same procedure above, with the PBE0 energies being defined as
[TABLE]
with dense denoting the calculation done with the dense k-point mesh and coarse denoting the coarser k-point mesh discussed above.
A.4 Construction of Training and Validation Sets
The total exchange energy of a molecule is highly basis-set and code-dependent, and it is dominated by the core contributions for large atoms. In practice, only relative exchange energies between systems is relevant for most chemistry and materials science applications. Therefore, rather than training the model to match total exchange energies, we trained to match the exchange energy differences between systems. For example, the target and predicted values for the atomization energy of water would be
[TABLE]
where and are the exact and model exchange energies for density distribution , respectively.
The training and validation sets were constructed from four sources:
Molecules: The W4-11, BH76, BH76RC, MB16-43, S22, G21IP, PA26, RG18, ACONF, ALKBDE10, HEAVYSB11, and SIE4x4 subsets from the GMTKN55 database were used for training and validation. Each set was randomly split into two-thirds training and one-third validation, except for the BH76 and BH76RC subsets. Because the data points in these subsets are based on the same systems, and also because they include the forward and backward reactions for the same chemical equation, a different splitting procedure was chosen to maintain train-validation independence. First, the 34 unique transition states in BH76 were identified and split half-and-half into the training and validation sets. All 40 reactions with the training transition states were used for training, and all 36 reactions with the validation transition states were used for validation. All 17 BH76RC reactions with systems in the BH76 training partition were used for training, and the other 13 reactions were used for validation. All data not in the training or validation set was classified as the GMTKN55 test set.
Bulk solids: Half of the SOL62 cohesive energy database (11 randomly selected elemental metals and 20 randomly selected non-metals) was used for training and validation. Nine systems (C-diamond, Li, GaAs, MgO, LiF, ZrC, TiC, Pd, and Pt) were hand-selected for the training set to cover a range of chemical space, while the other 22 systems were used for validation. All data not in the training or validation set was classified as the SOL62 test set.
Nanoclusters: Six nanocluster atomization energies were added to the training set as a “bridge” between the isolated and extended systems in the study. These were \cfCa13 and \cfZn26, with the initial structure obtained based on the Materials Project [110] lattice constant and structure optimization at the PBE/def2-TZVP level in Orca [111, 112, 113]; \cfC35H36 (hydrogen-capped diamond), obtained from Refs. [114, 115] and also structurally optimized in Orca at the PBE/def2-TZVP level; and \cfNa4Cl4, \cf[Na13Cl14]-, and \cf[Na14Cl13]+, all using the Materials Project lattice constant without further structure optimization.
Atoms: Absolute and relative energies of the isolated atoms H-Kr and a few excited spin states were included in the training set. Specifically, the training points consisted of the absolute energies of the noble gases He, Ne, Ar, and Kr; the relative energies of the elements with nuclear charge and for , and the relative energies of the following ground state/excited state spin pairs: Sc 1/3, Ti 2/4, V 3/5, Cr 6/4, and Ni 2/0. The preceding numbers correspond to the magnetic moment .
In addition to fitting to the above training data, the uniform electron gas constraint was enforced as described in Section II.3.3, and the uniform scaling constraint was enforced by the scale-invariance of the input features. The ground-state total energies of the SOL62 solids and isolated atoms were computed with GPAW as described in Section A.3. The ground-state energies for all other training data were evaluated with PySCF as described in Section A.2. The selection of train and validation databases from GMTKN55 includes a wide variety of chemistries and properties. However, because training data is sampled disproportionately from small-molecule properties, there is a large distribution shift between the train/validation and test sets, with the test sets containing more large molecules, isomerization reactions, and barrier heights. The use of this train/test partition therefore serves as a strong test of transferability. The accuracy on SOL62 also strongly reflects the transferability of a model because only 9 solid-state systems were included in the training set, and SOL62 covers a wide variety of chemistries.
A.5 A More Diverse Training Set for Numerical Stability
To design a more diverse training set for the NL-MGGA-DTR functional, a kernel function was defined for the “covariance” between two training points in GMTKN55:
[TABLE]
where is the vector of errors for 82 functionals benchmarked on the GMTKN55 database by Goerigk et al. [10] with D3(BJ) corrections [116, 117, 118], normalized so that the mean square error across all functionals for that individual data point is 1. This kernel essentially provides a larger similarity score between data points for which the (normalized) errors tend to be similar for a given functional. The kernel matrix was constructed for the entire GMTKN55 database and then factorized using pivoted Cholesky decomposition (Eq. 105), and then the first 225 pivots were used for the training set, the next 75 pivots for the validation set, and the remaining 1205 pivots for the test set. This procedure constructs a training set for which the typical errors of conventional functionals are representative of the entire dataset. With this dataset partitioning, more than 99% of the GMTKN55 test set data points contain at least one system not present in the train or validation sets, so the test set is still sufficiently independent to assess the transferability of the functionals to new systems. All training points not contained in GMTKN55 were kept the same as in Section A.4.
A.6 Control Point Selection for Sparse Gaussian processes
In Eqs. 33 and 34, a set of control points is required to evaluate the Gaussian process covariance kernel in the sparse approximation. These control points were initially selected by randomly sampling feature vectors from the PySCF grids for the training set (Appendix A.2). All training data was sampled except the SOL62 data, since the features for these systems were evaluated in GPAW.
Due to the size of the numerical grids from which the features were sampled, there were a large number of potential control points for the sparse Gaussian process, many of which were nearly linearly dependent in the implicit feature space of the covariance kernel function. To reduce the size of the control point set, a Cholesky decomposition with pivoting [119] was performed on an initial sample size of roughly feature vector samples:
[TABLE]
where is a permutation matrix that orders the sampled feature vectors in decreasing order of their variance conditioned on the preceding feature vectors. We performed the decomposition with a tolerance of and only kept control points corresponding to significant pivot indices, thereby reducing the number of control points from roughly to .
A.7 Selection of Noise Hyperparameters, Training Loss, and Validation Loss
Chemical and material properties have a wide range of magnitudes, from less than 1 kcal/mol for some van der Waals binding energies to greater than 100 kcal/mol for many covalent bond dissociation energies. Correspondingly, different magnitudes of XC errors are expected and acceptable for these properties; an error of 1 kcal/mol might make a van der Waals binding energy prediction useless but be acceptable for an atomization energy. To train the exchange functionals, it was therefore necessary to weight the noise hyperparameter of each training point to reflect the feasible and desired precision for the property of interest. To do so, we took the MAD of four hybrid functionals (PBE0, B3LYP, PW6B95, and B97X-V) on each sub-database (denoted , where DB is the sub-database of interest), and then created characteristic uncertainties for each sub-database as follows (in Hartree atomic units):
[TABLE]
where W4-11 is a small-molecule atomization energy sub-database in GMTKN55 and Eh. One drawback of the above approach is that it favors high accuracy on systems that are already described well by conventional, dispersion-corrected DFT (like simple van der Waals dimers) and assumes a large uncertainty on systems like self-interaction-dominated bond energies, for which even hybrid functionals with small exchange mixing perform poorly. To remedy this, the noise parameter was augmented with a constant weighting for every system:
[TABLE]
With this definition, the matrix in Eq. 39 is a diagonal matrix with being the for the database to which training point belongs. Constructing the Gaussian process for this noise parameter is equivalent to regression with the loss function
[TABLE]
where MSD stands for mean squared deviation and is the number of points in the subdatabase. This weighting ensures that training points with small expected errors are appropriately weighted while those with larger expected errors for conventional functionals do not become under-weighted. For systems outside the GMTKN55 database that were in the training set, was set by hand.
A.8 Hyperparameter Selection and Validation Procedure
The models tested in this work have several discrete and continuous hyperparameters to be optimized, which are listed below.
- •
The covariance prefactor hyperparameter in eqs 52–55.
- •
The kernel length-scale (, where is the feature index) in eq 56.
- •
For the nonlocal functionals only, the feature exponent parameters and in Eqs. 18–21.
- •
The baseline exchange functional on top of which the CIDER model is trained via -learning. We tested PBE [23] and the Chachiyo GGA [73].
It would be quite challenging and expensive to implement differentiation of the marginal likelihood with respect to the first two parameters, and nearly impossible for the third since backpropagation would need to be performed through the feature evaluation step. Therefore, we used various combinations of hyperparameters to train the models, and then for each functional type, we selected the best-performing model on the validation set as the final candidate functional.
To obtain a discrete set of exponent hyperparameters to test for the nonlocal functionals, we developed a reduced number of parameters to specify the feature exponents. Similarly to the previous CIDER work [35], a control parameter tunes and simultaneously, in two different schemes. The first scheme (S1) is
[TABLE]
The second scheme (S2) provides a different mixture of density-based and gradient/kinetic energy-based terms:
[TABLE]
In both schemes, the other constants are determined by and :
[TABLE]
where is an additional parameter to tune the and coefficients separately from the other coefficients. With these schemes, all 8 parameters that define the three-feature CIDER model were reduced to two parameters per scheme. For each scheme, we performed a hyperparameter search over the grid and , resulting in a total of 50 feature constructions to test.
Because there were already 50 features from which to select the nonlocal functionals, we set the covariance and feature length-scales for these models heuristically:
[TABLE]
where is an expectation value over all the grid points in the exchange energy training set, is the deviation of the exact exchange enhancement factor at a given point from the baseline functional enhancement factor, and and are constants. and were set based on the criteria that the resulting functionals converge consistently in both GTO and plane-wave calculations and exhibit a small total energy deviation between the PAW and all-electron GTO atomization energies for a few small molecules. worked best for both the nonlocal GGA and meta-GGA, while the most practical values of depended on the baseline functional and model type, as shown in Table 6. The meta-GGA models required a smaller covariance prefactor to achieve good stability than the GGA models. We set to be larger by a factor of 20 for functionals with the Chachiyo baseline compared to the PBE baseline because was roughly 20 times larger for PBE than Chachiyo.
For the semilocal functionals, there was no need to validate over the possible feature constructions, leaving more flexibility to test and systematically. For the functionals with Chachiyo baseline, we tested , while for the PBE baseline each was divided by 20 to give a roughly equivalent . For the semilocal functionals, were tested, resulting in 24 functionals to validate. Note that for NL-MGGA-DTR, we determined from validation set SCF calculations that with the PBE baseline resulted in sufficiently stable models, so this larger value was used. In addition, we did not search over feature length-scale for NL-MGGA-DTR, and instead used the features determined to be most accurate for NL-MGGA and NL-MGGA-PBE.
After the functionals were trained, we selected the top 10 functionals of each type and baseline functional, as classified by non-self-consistent performance on the validation set via Eq. 108. To validate these functionals, the GMTKN55 train and validation set reaction energies were predicted using self-consistent calculations with a convergence threshold of Eh and a gradient threshold of Eh. For a given functional, if any calculation failed to converge when starting from both the default initial guess of PySCF and from the PBE orbitals, the functional was deemed insufficiently numerically stable and disqualified from further consideration. The non-self-consistent cohesive energies of the SOL62 validation set were also computed. Of the functionals for which all calculations converged, the validation loss consisted of
[TABLE]
where is the self-consistent mean square deviation on the validation partition of GMTKN55 subset DB, and where is the non-self-consistent mean square deviation of the 22 validation-set cohesive energies per atom. The final functional of each type was selected based on which candidate had the lowest loss for Eq. 115. The validation results and selected hyperparameters for the functionals are provided in the Supporting Information (Section S4).
A.9 Difficult Convergence Cases
The default convergence tolerance for the calculations performed with CIDER functionals was Eh, except for the NL-MGGA-PBE and NL-MGGA-DTR calculations, which were converged to Eh when possible. Some CIDER calculations did not converge to this tolerance within the default PySCF settings. There were two different classes of convergence problems. First, for a few systems, some CIDER calculations did not converge from the PySCF starting guess, but the convergence issues were resolved by starting from PBE ground-state orbitals. Second, for some systems (especially isolated atoms and ions, as well as alkali and alkaline earth-containing systems), a numerical stability issue was observed for some CIDER calculations, in which the energy converged to a reasonable tolerance (less than Eh), but the orbital gradients oscillated around Eh without converging fully. In these cases, the energy tolerance was set to Eh, and the gradient tolerance was set to the (rather large) Eh. All CIDER calculations converged to these thresholds. PBE starting orbitals were also used for these loose-threshold calculations. Even with these looser thresholds, the achieved energy and orbital gradient convergence is still sufficient to compare thermochemical data in the GMTKN55 database.
A.10 CIDER Functional Settings
The CIDER functionals require a few additional settings to specify the precision of the accelerated feature calculations. For the molecular version, these settings are , the maximum spherical harmonic for auxiliary basis expansion; , in Eq. 58; , the even-tempered Gaussian factor for the basis in Eq. 67; and , the maximum value of . The even-tempered Gaussian factor for the basis in Eq. 71 was also set to . We chose , , and , where is the minimum of 36 and the charge of the largest nucleus in the given GMTKN55 sub-database. These settings were found to be in good agreement with numerical integration of the features. The minimum control point was set automatically based on the minimum allowed value of the kernel exponent .
For the plane-wave implementation, only and needed to be specified. A larger was used for each atom on the radial support grids, with being the atomic number.
A.11 Band Gap Calculations
Band gaps for the database of Borlido et al. [42] were computed in GPAW [60, 61] as the difference between the conduction band minimum and valence band maximum for a ground-state SCF calculation. The energy cutoff for these calculations was 520 eV, and the grid spacing was set to Å to avoid any aliasing of the density. Fermi-Dirac k-point smearing was used with a width of 0.01 eV, except for the noble gas systems, for which no k-point smearing was used. The k-point mesh for each system was selected to be the smallest even, -centered mesh with a linear k-point density of at least 6 k-points per Å*-1*. For each system, if the PBE band gap we obtained was more than 0.1 eV larger than the PBE band gap obtained by Borlido et al. [42], we inferred that the k-point mesh was too coarse and reran the band gap calculation with a linear k-point density of 10 k-points per Å*-1*. Using this methodology, we reproduced the PBE band gaps of Borlido et al. with a mean, mean absolute, and maximum error of -0.01, 0.03, amd 0.17 eV, respectively, and the SCAN results with a mean, mean absolute, and maximum error of 0.00, 0.03, and 0.15 eV, respectively.
For PBE0/NL-MGGA-DTR and PBE0(0.35)/NL-MGGA-DTR, a few systems did not converge with the default CIDER PAW corrections, and a denser core integration grid and smaller set of feature projectors (see Appendix B) were used. Even with these settings, one system (\cfReSe2) did not converge with PBE0(0.35)/NL-MGGA-DTR, and it had to be run with SG15 norm-conserving pseudopotentials (NCPP) [120]. For the 255 calculations in the main text, the mean, mean absolute, and maximum deviations between PBE0/NL-MGGA-DTR with NCPP and PAW were 0.03, 0.12, and 1.12 eV, respectively. In addition, the band gap for \cfReSe2 with PBE0/NL-MGGA-DTR was 1.29 eV with PAW and 1.30 eV with NCPP, so the use of pseudopotentials appears reasonable for this particular case. We think that the convergence issues are due to numerical problems with the PAW corrections for nonlocal features, and improving the numerical stability of these corrections will be a subject of future work.
For the band gaps in Table 4, the same procedure was used, except with denser k-point meshes containing 12 k-points per Å*-1*. The HSE06 band gaps were computed using VASP [88], with the k-point mesh set to contain 4000 k-points divided by the number of atoms in the unit cell. This is a coarser mesh than was used for the semilocal and CIDER calculations, but we found that the band gap difference compared to larger k-point densities was less than 0.01 eV.
A.12 Supercell and Neutral Defect Performance Benchmark
The atomic structure of the 215-atom single vacancy cells of diamond and silicon were obtained using VASP [88, 89, 90] with PAW datasets [66, 62], an energy cutoff of 520 eV, the “Accurate” precision setting, and a Monkhorst-Pack k-point mesh [121], starting with the PBE lattice constant obtained in VASP using the same settings with a -centered k-point mesh. Using these structures, SCF ground-state calculations with static atomic positions were performed in GPAW [60, 61] for the 216-atom bulk and 215-atom vacancy cells for each functional. The formation energy of the neutral defect was computed as
[TABLE]
where the bulk and defect energies were computed with 216 and 215-atom supercells, respectively, with a Monkhorst-Pack k-point mesh, 520 eV plane-wave cutoff, 0.134 Å grid spacing, and eV convergence threshold. The time to evaluate the formation energy for each functional was computed as the sum of the time taken for the bulk and vacancy cell calculations, as obtained from the GPAW internal calculation timer. The HSE06 and PBE0 formation energies were obtained using VASP with the same structures, k-point grid, and energy cutoff.
For the -point calculations used in fig 8a, GPAW was also used for the semilocal and ML functionals. For the hybrid DFT calculations, JDFTx [87] was used with the ACE [17] algorithm for exchange, a 19.2 Eh plane-wave cutoff, a Eh convergence threshold, and SG15 norm-conserving pseudopotentials [120].
A.13 Charged Defect Calculations
The charged defect levels were computed with PBE, PBE0/NL-MGGA-DTR, and PBE0(0.3)/NL-MGGA-DTR. The mixing parameter of the last functional was selected to match the zero-temperature experimental band gap of silicon, 1.17 eV. For each functional, the lattice constant, band edges, and band gap were obtained from a unit cell relaxation with 520 eV cutoff, Å, and a k-point mesh in GPAW [60, 61]. Then, a 512-atom supercell was created for each defect, with perturbed structures used for the vacancy to break symmetry. For each charge state and defect, the geometry was relaxed in GPAW using the BFGS with Line Search algorithm to a tolerance of 0.05 eV/Å with a -point only calculation. This geometry was then used as the starting point for the final geometry relaxation, which was performed with a -centered k-point mesh and a force convergence tolerance of 0.01 eV/Å. Because PBE0(0.3)/NL-MGGA-DTR is so similar to PBE0/NL-MGGA-DTR, and because the lattice constants are similar, we saved time for this functional by starting the final relaxation from the fractional coordinates of the PBE0/NL-MGGA-DTR calculations.
The formation energies of the defects were calculated as [6]
[TABLE]
where is the total ground-state energy of the supercell DFT calculation, is defect in charge state , is the Fermi level, and is a finite-size energy correction for the defect. The stoichiometry changes and chemical potentials and were neglected because they do not impact electronic transition levels. We used the Freysoldt finite-size correction-scheme for [91], which combines an electrostatic and potential alignment correction. This correction requires an estimate of the dielectric constant , for which we used the experimental value of 11.9 [122].
Because some of the defects studied in this work are shallow donors and acceptors, and because electrostatic corrections like those developed by Freysoldt et al. [91] and Kumagai et al. [123] assume localized charges, these corrections might be inaccurate. Therefore, for comparison, we also computed the transition levels using the potential alignment correction only, i.e.
[TABLE]
where is the average electrostatic potential in a region far from the defect site. This is equivalent to the full Freysoldt correction with . We used pymatgen [102] to perform the corrections.
Appendix B Construction of Feature Projector Functions for CIDER within PAW
B.1 Construction of Feature Smoothing Projectors
We solve for the coefficients and in Eqs. 82 and 83 by minimizing the loss function
[TABLE]
For the current implementation, , but this could be changed if needed to encourage better smoothness or numerical precision. We start with the following set of functions:
[TABLE]
where and are the sets of positive odd and even integers, respectively. Then we introduce the matrix
[TABLE]
In the above equation, is an matrix,
[TABLE]
where is the number of functions and is the number of control points for kernel interpolation. Similarly, is an matrix,
[TABLE]
where is the number of functions. The constants , , and are regularization constants to reduce the magnitude of the feature augmentation correction in the core region, which helps with numerical stability by preventing the pseudo-feature from being unrealistically large in the core.
Lastly, is an matrix:
[TABLE]
The coefficients can be solved by minimizing the loss Eq. 121, which results in the linear system
[TABLE]
where
[TABLE]
The matrix is block-diagonal in the angular momentum channels due to the orthogonality of spherical harmonics, so each angular momentum channel can be solved separately. This makes the whole routine fairly efficient, and the number of operations per SCF cycle is linear-scaling in the number of atoms. The Fourier transforms of the and functions needed for Eqs. 133 and 134 are obtained using the NUMSBT algorithm [124], and the is decomposed into angular momentum channels via the term obtained in Eq. 80. Also, note that there are a larger number of and indices in the atomic augmentation regions than on the FFT grid. The above fitting procedure is only used for indices going up to the maximum index on the FFT grid. The higher indices (corresponding to larger values of that are only relevant in the core regions) are solved by simply projecting onto the basis, i.e. setting in the subspace of core region-only indices. This procedure is equivalent to assuming that the larger contributions to and are localized in the core region, which is reasonable because these contributions correspond to high densities and short length-scales for .
B.2 Construction of the Feature Augmentation Projectors
The feature augmentation projectors used in Eq. 87 are constructed as
[TABLE]
The on-site function sets and are then computed as
[TABLE]
The construction localizes inside the core region and satisfies the condition that .
Acknowledgements.
The authors thank Jennifer Coulter for helpful guidance on running JDFTx calculations. This work was supported by the National Defense Science and Engineering Graduate (NDSEG) Fellowship Program under contract FA9550-21-F-0003, the Camille and Henry Dreyfus Foundation Grant No. ML-22-075, and the Department of Navy award N00014-20-1-2418 issued by the Office of Naval Research. Computational resources were provided by the Harvard University FAS Division of Science Research Computing Group.
Appendix S1 Table of all GMTKN55 Subset Error Metrics
This section consists of a large table (shown below) with detailed GMTKN55 error metrics for each functional presented in this work. All entries contain the mean absolute deviation (MAD) relative to GMTKN55 reference values, except for the aggregate errors WT2 (WTMAD-2 as defined by Goerigk et al. [10]) and MOM (mean of means). The “DTR tr,” “DTR val,” and “DTR test” labels denote the train/test partition used to train NL-MGGA-DTR on GMTKN55, while “tr/val” and “test” refer to the train/test partition used for all other functionals. All units are in kcal/mol.
[TABLE]
[TABLE]
[TABLE]
Appendix S2 Analysis of GMTKN55 Results with Different Error Averaging Metrics
S2.1 Results Relative to PBE0
As mentioned in Section III.1, the analysis of functional accuracy can be sensitive to the averaging scheme for aggregate errors across the entire GMTKN55 database. In Fig. S1, we plot the average deviations relative to PBE0 of semilocal and ML functionals using two altnerative metrics to the mean of means (MoM) used in Section III.1: mean absolute deviation (MAD, top) and weighted MAD (specifically, the WTMAD-2 metric used by Goerigk et al. [10]). The general trends for these metrics are similar to those for the MoM; the SL-MGGA is more accurate than the SL-GGA and NL-GGA, and the NL-MGGA is more accurate than the SL-MGGA. However, the transferability issues suffered by the SL-GGA and NL-GGA appear even more severe when using the WTMAD-2 metric, and the NL-GGA is actually less accurate than the SL-GGA overall. The WTMAD-2 metric gives larger weight to subdatabases with smaller reaction energy magnitudes (e.g. noncovalent interactions are weighted more heavily than atomization energies), so this worse performance for WTMAD-2 suggests that the SL-GGA and NL-GGA cannot accurately describe the smaller density variations associated with these reactions. Notably, the SL-MGGA suffers a similar problem on the intermolecular NCI database; as a result, the NL-MGGA is the only functional that is closer to PBE0 than r2SCAN for every sub-database, a testament to its robustness.
S2.2 Results Relative to Reference Values
Figure S2 shows the performance of the semilocal, hybrid, and CIDER functionals on the GMTKN55 database with the three different averaging metrics MoM, MAD, and WTMAD-2 discussed in the previous section. The same dispersion corrections are used as in Section III.4. The results with MoM and MAD are very similar, while the performance of CIDER functionals relative to hybrid DFT is notably worse for the WTMAD-2. This again speaks to the difficulty of capturing small relative energies accurately for CIDER functionals, which will need to be a topic of future research. However, it is notable that the SL-MGGA models suffer from this problem more severely than the NL-MGGA-DTR; PW6B95/SL-MGGA has a worse WTMAD-2 than r2SCAN, while PW6B95/NL-MGGA-DTR has a WTMAD-2 that is better than r2SCAN and comparable to B3LYP and PBE0. As shown in Table S1, the NL-MGGA and NL-MGGA-PBE models also significantly outperform the SL-MGGA models, so the worse performance of SL-MGGA is not an artefact of the different training set used for NL-MGGA-DTR. From this analysis, we conclude that introducing a more flexible feature set like that used for the NL-MGGA and its variants results in models that perform better for a variety of systems and therefore are less sensitive to a particular error metric.
Appendix S3 More data on band gaps
Table S2 contains the error metrics for semilocal, PBE0/CIDER, and hybrid functionals for all 453 systems in the database of Borlido et al. [42] that do not contain La, Th, or Yb. Note that for several systems, we used SG15 norm-conserving pseudopotentials [120] as described in Appendix A.11.
Appendix S4 Validation Results and Selected Hyperparameters
Tables S3-S6 show the final validation results for each functional type, ordered by validation loss (Eq. 115) and labeled by internal project code name. Table S7 shows the final hyperparameters for the CIDER functionals presented in the main text. For all of these figures, see Appendix A.8 for hyperparameter definitions.
Appendix S5 Density Accuracy
The density error for a molecule is measured using numerical integrals as
[TABLE]
where and are the real-space integration weights and positions for molecule . The aggregate percentage error for a sub-database is
[TABLE]
Table S8 shows the density deviations from PBE0 for PBE, r2SCAN, and the ML functionals over several subsets of GMTKN55. The NL-MGGA achieves the best match to PBE0 of the ML functionals, and this improvement is transferable to the test set. r2SCAN, however, matches PBE0 more closely than the ML functionals for most subsets. This suggests there might still be some room for improvement in designing ML functionals with accurate densities. However, NL-MGGA yields smaller deviations from PBE0 densities than PBE does for every subset, so the density distributions get closer to those of PBE0 when the ML exchange is added.
Appendix S6 Bond Lengths and Lattice Constants
To assess the accuracy of CIDER functionals for structure prediction, we performed two benchmarks. First, we computed the bond lengths of six diatomic molecules in PySCF (Table S9) and compared the PBE0/CIDER bond length predictions to those of PBE0. For these calculations, the def2-QZVPPD basis set was used without density fitting, level 4 PySCF grids were used, and the internal nonlocal feature evaluation settings were , , , and . The results are shown in Table S9 and suggest that with increasing model complexity, the bond lengths more closely match PBE0. For reference, the atomization energy predictions for these molecules are shown in Table S10 and demonstrate a similar trend. Note that the CIDER models are trained on these systems, so this benchmark serves only as a basic sanity test for CIDER forces, and not as a test of generalizability.
For the second benchmark, we computed the lattice constants of the materials in the SOL62 database [67, 33] in GPAW (with the same settings as were used for the cohesive energy calculations, c.f. Sections A.3 and A.10) and compared the lattice constants to their experimental values. We separated the results into the train, validation, and test partitions to evaluate model generalizability. Also, we only computed the lattice constants for the PBE0/NL-MGGA-DTR CIDER functional since it was the most numerically stable. The results are shown in Table S11. No zero-point vibrational corrections were performed, but the mean absolute lattice constant deviation with and without these corrections for PBE is 0.01 Å, significantly smaller than the typical deviations between the DFT predictions and experiment.
The MAE of PBE0/NL-MGGA-DTR compared to experiment is slightly smaller than that of PBE on the test set, but larger than that of r2SCAN and HSE06. Due to unusually large errors on a few alkali and alkaline earth metals (i.e. K, Rb, and Ba), PBE0/NL-MGGA-DTR also has a slightly larger RMSE than PBE. The results suggest that overall, the ML functional can predict reasonable, GGA-accuracy structures for materials outside the training set, but more work is needed to achieve the accuracy of more advanced functionals like r2SCAN and HSE06 for structure optimization. In the future, we will also need to investigate the cause of NL-MGGA-DTR severely underbinding the larger alkali/alkaline earth metals (even compared to HSE06).
Appendix S7 More Detailed Defect Transition Level Data
This section provides more details on the defect transition level data. The bulk lattice constant and band gap, determined by the methods in Section A.5, are tabulated in Table S12. The defect transition levels themselves are shown in Table S13.
S7.1 The Cu +/0 Transition Level
As discussed in the main text, the copper +1/0 transition level computed with PBE0(0.3)/NL-MGGA-DTR is not converged with the supercell size and k-point mesh used with HSE06 [85] by Sharan et al. [98]. To further illustrate this point, Table S14 shows the +1/0 energy transition level for copper for HSE06 and PBE0(0.3)/NL-MGGA-DTR for several k-point meshes. For this comparison, finite-size effects are ignored, so
[TABLE]
(with being the defect supercell). For consistency, all calculations were performed with a geometry relaxed with HSE06. The lattice constant was optimized with HSE06 with an -centered k-point mesh, and the defect geometries were optimized with HSE06 with the -centered k-point mesh. No restrictions were placed on the magnetic moment for structural relaxation, but for consistency, all static calculations for computing energy level differences were computed using fixed total magnetic moments of for the neutral state and for the +1 state. These are the ground-state spins as determined by Sharan et al. [98], though it is worth noting that for the 512-atom supercell and -centered k-point mesh, the spin state is favored over . The HSE06 calculations were performed with VASP [88], while the CIDER calculations were performed with GPAW [60].
The primary takeaway from Table S14 is that neither the CIDER nor the HSE06 calculations are converged with the k-point meshes. The most significant deviation is between the MP and -centered mesh for HSE06, for which the difference is 0.15 eV. In addition, PBE0(0.3)/NL-MGGA-DTR and HSE06 are in good agreement for all calculations for a given k-point mesh, with deviations of 0.02-0.04 eV.
References
All reference numbers refer to the references in the main text.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Mardirossian and Head-Gordon [2017] N. Mardirossian and M. Head-Gordon, Thirty years of density functional theory in computational chemistry: an overview and extensive assessment of 200 density functionals, Mol. Phys. 115 , 2315 (2017) . · doi ↗
- 2Kohn and Sham [1965] W. Kohn and L. J. Sham, Self-Consistent Equations Including Exchange and Correlation Effects, Phys. Rev. 140 , A 1133 (1965) . · doi ↗
- 3Wellendorff et al. [2015] J. Wellendorff, T. L. Silbaugh, D. Garcia-Pintos, J. K. Nørskov, T. Bligaard, F. Studt, and C. T. Campbell, A benchmark database for adsorption bond energies to transition metal surfaces and comparison to selected DFT functionals, Surf. Sci. 640 , 36 (2015) . · doi ↗
- 4He et al. [2019] Q. He, B. Yu, Z. Li, and Y. Zhao, Density Functional Theory for Battery Materials, ENERGY Environ. Mater. 2 , 264 (2019) . · doi ↗
- 5Lany and Zunger [2008] S. Lany and A. Zunger, Assessment of correction methods for the band-gap problem and for finite-size effects in supercell defect calculations: Case studies for Zn O and Ga As, Phys. Rev. B 78 , 235104 (2008) . · doi ↗
- 6Freysoldt et al. [2014] C. Freysoldt, B. Grabowski, T. Hickel, J. Neugebauer, G. Kresse, A. Janotti, and C. G. Van de Walle, First-principles calculations for point defects in solids, Rev. Mod. Phys. 86 , 253 (2014) . · doi ↗
- 7Perdew [2001] J. P. Perdew, Jacob’s ladder of density functional approximations for the exchange-correlation energy, in AIP Conf. Proc. , Vol. 577 (AIP, 2001) pp. 1–20. · doi ↗
- 8Perdew and Levy [1983] J. P. Perdew and M. Levy, Physical Content of the Exact Kohn-Sham Orbital Energies: Band Gaps and Derivative Discontinuities, Phys. Rev. Lett. 51 , 1884 (1983) . · doi ↗
