Nonlocal Machine-Learned Exchange Functional for Molecules and Solids

Kyle Bystrom; Boris Kozinsky

arXiv:2303.00682·physics.chem-ph·August 19, 2024

Nonlocal Machine-Learned Exchange Functional for Molecules and Solids

Kyle Bystrom, Boris Kozinsky

PDF

Open Access

TL;DR

This paper introduces a machine-learned, nonlocal exchange functional for DFT that achieves hybrid-DFT accuracy at a computational cost comparable to semilocal functionals, benefiting both molecules and solids.

Contribution

It develops a novel machine learning-based exchange functional that is orbital-dependent and nonlocal, offering high accuracy with low computational cost for diverse systems.

Findings

01

Achieves hybrid-DFT accuracy on thermochemical benchmarks.

02

Improves band gap predictions over semilocal DFT.

03

Demonstrates scalability in large supercell calculations.

Abstract

The design of better exchange-correlation functionals for Density Functional Theory (DFT) is a central challenge of modern electronic structure theory. However, current developments are limited by the mathematical form of the functional, with efficient semilocal functionals being inaccurate for many technologically important systems and the more accurate hybrid functionals being too expensive for large solid-state systems due to the use of the exact exchange operator. In this work, we use machine learning combined with exact physical constraints to design an exchange functional that is both orbital-dependent and nonlocal, but which can be evaluated at roughly the cost of semilocal functionals and is significantly faster than hybrid DFT in plane-wave codes. By training functionals with several different feature sets, we elucidate the roles of orbital-dependent and nonlocal features in…

Tables20

Table 1. Table 1: Mean absolute deviation (kcal/mol) of the four types of ML functional for simple 1 and 2-electron systems compared to Hartree-Fock.

Model	1e Systems¹¹1The 1-electron system dataset consists of the energies of \cfH2+, \cfH3^2+-linear, and \cfH3^2+-triangular referenced to the energy of the isolated hydrogen atom. All nearest-neighbor bond lengths are 0.7 Å.	2e Systems²²2The 2-electron system dataset consists of three energy differences between \cfHe+, \cfHe-singlet, and \cfHe-triplet, where \cfHe-triplet is the neutral helium atom excited state with both electrons having the same spin.
SL-GGA	$2.9$	$8.7$
NL-GGA	$2.5$	$5.5$
SL-MGGA	$4.8$	$3.9$
NL-MGGA	$1.1$	$2.7$

Table 2. Table 2: The number of systems in the GMTKN55 database for which a given functional did not achieve SCF convergence with thresholds of 10 − 8 superscript 10 8 10^{-8} Eh for total energy and 10 − 4 superscript 10 4 10^{-4} Eh for the orbital gradient of the energy. All functionals use the PBE0/CIDER form (Eq. 97 ).

Functional	No. Unconverged
SL-GGA	10
NL-GGA	16
SL-MGGA	5
NL-MGGA	3
NL-MGGA-PBE	1
NL-MGGA-DTR	0

Table 3. Table 3: Error statistics for 255 band gaps from the database published by Borlido et al. [ 42 ] , in eV. The ML functionals use the PBE0( α 𝛼 \alpha )/CIDER formalism of Eq. 98 .

	PBE³³3This work.	r²SCAN¹¹footnotemark: 1	SL-MGGA¹¹footnotemark: 1	NL-MGGA-DTR¹¹footnotemark: 1		HSE06⁴⁴4Borlido et al. [42]	PBE0²²footnotemark: 2
			$α = 0.25$	$α = 0.25$	$α = 0.35$
ME⁵⁵5Mean error (eV)	$- 1.14$	$- 0.81$	$- 0.74$	$- 0.61$	$- 0.38$	$- 0.17$	$0.40$
MAE⁶⁶6Mean absolute error (eV)	$1.16$	$0.87$	$0.82$	$0.71$	$0.58$	$0.53$	$0.73$
RMSE⁷⁷7Root mean square error (eV)	$1.50$	$1.18$	$1.10$	$1.02$	$0.86$	$0.80$	$0.90$
MARE⁸⁸8Mean absolute relative error	$0.53$	$0.41$	$0.41$	$0.35$	$0.33$	$0.33$	$0.67$

Table 4. Table 4: Band gaps for a small subset of solids (in eV). CIDER refers to PBE0/NL-MGGA-DTR.

	PBE	r²SCAN	CIDER	HSE06	Expt. [42]
Si	$0.57$	$0.76$	$1.05$	$1.15$	$1.17$
Ge	$0.12$	$0.48$	$0.71$	$0.81$	$0.74$
InP	$0.69$	$1.12$	$1.47$	$1.52$	$1.42$
GaAs	$0.59$	$1.08$	$1.36$	$1.43$	$1.52$
CdSe	$0.71$	$1.19$	$1.48$	$1.69$	$1.74$
BP	$1.25$	$1.42$	$1.71$	$1.98$	$2.10$
GaP	$1.61$	$1.88$	$2.09$	$2.29$	$2.35$
CdS	$1.20$	$1.65$	$2.02$	$2.27$	$2.48$
GaN	$1.88$	$2.33$	$2.65$	$3.18$	$3.50$
ZnS	$2.13$	$2.69$	$3.04$	$3.37$	$3.72$
C	$4.13$	$4.34$	$4.70$	$5.33$	$5.50$
BN	$4.21$	$4.79$	$5.08$	$5.63$	$5.96$
CaO	$3.68$	$4.23$	$4.53$	$5.32$	$6.88$
MgO	$4.73$	$5.63$	$6.11$	$6.42$	$7.67$
NaCl	$5.11$	$5.90$	$6.32$	$6.45$	$8.75$
LiF	$9.07$	$10.05$	$10.40$	$11.39$	$13.60$
Ar	$8.71$	$9.61$	$9.71$	$10.36$	$14.15$
ME	$- 1.93$	$- 1.42$	$- 1.11$	$- 0.74$	-
MAE	$1.93$	$1.42$	$1.11$	$0.76$	-
RMSE	$2.41$	$1.88$	$1.66$	$1.31$	-
MARE	$0.45$	$0.29$	$0.18$	$0.11$	-

Table 5. Table 5: The formation energies of the neutral vacancies in diamond and silicon (in eV), computed with 216-atom supercells and a 2 × 2 × 2 2 2 2 2\times 2\times 2 Monkhorst-Pack k-point mesh. All CIDER functionals use the PBE0/CIDER surrogate hybrid form.

Functional	C	Si
PBE	$6.47$	$3.61$
r²SCAN	$6.36$	$4.19$
SL-GGA	$6.11$	$3.48$
NL-GGA	$6.73$	$3.50$
SL-MGGA	$6.66$	$4.16$
NL-MGGA	$6.77$	$4.58$
NL-MGGA-PBE	$6.77$	$4.58$
NL-MGGA-DTR	$6.79$	$4.42$
HSE06	$6.96$	$4.54$
PBE0	$6.97$	$4.65$

Table 6. Table 6: R 1 subscript 𝑅 1 R_{1} values for each functional type and baseline functional.

Type/Baseline	$R_{1}$
NL-GGA/PBE	$1.0$
NL-GGA/Chachiyo	$20.0$
NL-MGGA/PBE	$0.05$
NL-MGGA/Chachiyo	$1.0$

Table 7. Table S1: WTMAD-2 (kcal/mol) for a few CIDER functionals. PW6B95/NL-MGGA was not evaluated on GMTKN55.

Model	SL-MGGA	NL-MGGA	NL-MGGA-PBE	NL-MGGA-DTR
PBE0/CIDER	7.86	6.59	6.62	7.32
PW6B95/CIDER	7.99	–	6.58	6.29

Table 8. Table S2: Error statistics for 453 band gaps from the database published by Borlido et al. [ 42 ] . The ML functionals use the PBE0( α 𝛼 \alpha )/CIDER formalism of Eq. 98 . a This work. b Borlido et al. [ 42 ]

	PBE⁹⁹9This work.	r²SCAN¹¹footnotemark: 1	SL-MGGA¹¹footnotemark: 1	NL-MGGA-DTR¹¹footnotemark: 1		HSE06¹⁰¹⁰10Borlido et al. [42]	PBE0²²footnotemark: 2
			$α = 0.25$	$α = 0.25$	$α = 0.35$
ME (eV)	$- 1.03$	$- 0.68$	$- 0.61$	$- 0.51$	$- 0.27$	$- 0.10$	$0.47$
MAE (eV)	$1.05$	$0.77$	$0.74$	$0.66$	$0.57$	$0.53$	$0.78$
RMSE (eV)	$1.46$	$1.15$	$1.10$	$1.04$	$0.93$	$0.86$	$0.99$
MRE	$- 0.42$	$- 0.22$	$- 0.18$	$- 0.13$	$0.02$	$0.10$	$0.54$
MARE	$0.47$	$0.36$	$0.35$	$0.32$	$0.33$	$0.31$	$0.62$
RMSRE	$0.55$	$0.52$	$0.51$	$0.51$	$0.62$	$0.67$	$1.29$

Table 9. Table S3: SL-GGA validation results according to Eq. 115 , with PBE provided as reference. Loss in kcal/mol.

Code Name	Baseline	$R_{1}$	$R_{2}$	Loss
SL_GGA_VC_000	Chachiyo	8	0.5	7.035046
SL_GGA_VC_003	Chachiyo	4	0.5	7.052635
SL_GGA_VP_008	PBE	0.1	0.5	7.079104
SL_GGA_VC_009	Chachiyo	2	0.5	7.092393
PBE	–	–	–	10.868648
SL_GGA_VP_000	PBE	0.4	0.5	16.625300
SL_GGA_VP_003	PBE	0.2	0.5	38.890716

Table 10. Table S4: NL-GGA validation results according to Eq. 115 , with PBE provided as reference. Loss in kcal/mol.

Code Name	Baseline	Scheme	$A$	$D$	Loss
NL_GGA_VC_004	Chachiyo	S1	1	0.5	3.904467
NL_GGA_VP_004	PBE	S1	1	0.5	3.928614
NL_GGA_VC_003	Chachiyo	S1	2	0.5	3.990208
NL_GGA_VP_003	PBE	S1	2	0.5	4.007883
NL_GGA_VP_005	PBE	S2	2	0.5	4.057075
NL_GGA_VC_005	Chachiyo	S2	2	0.5	4.061060
NL_GGA_VC_008	Chachiyo	S1	0.5	0.5	4.286322
NL_GGA_VP_008	PBE	S1	0.5	0.5	4.290059
NL_GGA_VP_001	PBE	S1	4	0.5	4.329216
NL_GGA_VC_001	Chachiyo	S1	4	0.5	4.342482
NL_GGA_VC_009	Chachiyo	S2	0.5	4	4.407261
NL_GGA_VP_009	PBE	S2	0.5	4	4.449937
NL_GGA_VP_000	PBE	S2	4	0.5	4.568491
NL_GGA_VC_000	Chachiyo	S2	4	0.5	4.581439
NL_GGA_VC_006	Chachiyo	S2	0.5	0.5	4.798940
NL_GGA_VP_007	PBE	S2	0.5	0.5	4.800655
PBE	–	–	–	–	10.868648

Table 11. Table S5: SL-MGGA validation results according to Eq. 115 , with PBE provided as reference. Loss in kcal/mol.

Code Name	Baseline	$R_{1}$	$R_{2}$	Loss
SL_MGGA_VC_006	Chachiyo	64	0.5	3.218994
SL_MGGA_VP_007	PBE	3.2	0.5	3.220336
SL_MGGA_VP_009	PBE	1.6	0.5	3.286767
SL_MGGA_VC_008	Chachiyo	32	0.5	3.303520
SL_MGGA_VC_009	Chachiyo	16	0.5	3.347484
SL_MGGA_VP_008	PBE	0.05	0.25	3.724633
SL_MGGA_VP_004	PBE	0.1	0.25	3.771061
SL_MGGA_VP_002	PBE	0.2	0.25	3.872119
SL_MGGA_VC_007	Chachiyo	2	0.25	3.934145
SL_MGGA_VC_003	Chachiyo	4	0.25	3.995785
SL_MGGA_VP_000	PBE	0.4	0.25	4.085934
SL_MGGA_VC_001	Chachiyo	8	0.25	4.130502
SL_MGGA_VP_001	PBE	0.8	0.25	4.370939
PBE	–	–	–	10.868648

Table 12. Table S6: NL-MGGA validation results according to Eq. 115 , with PBE provided as reference. Loss in kcal/mol.

Code Name	Baseline	Scheme	$A$	$D$	Loss
NL_MGGA_VC_005	Chachiyo	S1	1	1	1.895010
NL_MGGA_VP_007	PBE	S1	1	1	1.930917
NL_MGGA_VC_006	Chachiyo	S1	0.5	1	2.011226
NL_MGGA_VP_005	PBE	S1	0.5	1	2.039220
NL_MGGA_VC_000	Chachiyo	S2	1	1	2.039227
NL_MGGA_VP_000	PBE	S2	1	1	2.044209
NL_MGGA_VP_003	PBE	S2	1	2	2.063379
NL_MGGA_VC_003	Chachiyo	S2	1	2	2.063773
NL_MGGA_VP_006	PBE	S2	2	0.5	2.100342
NL_MGGA_VC_007	Chachiyo	S2	2	0.5	2.118458
NL_MGGA_VC_001	Chachiyo	S2	0.5	2	2.284749
NL_MGGA_VP_002	PBE	S2	0.5	2	2.313989
NL_MGGA_VP_004	PBE	S2	0.5	4	2.457190
NL_MGGA_VC_004	Chachiyo	S2	0.5	4	2.469842
NL_MGGA_VP_009	PBE	S1	0.5	4	2.504403
NL_MGGA_VC_002	Chachiyo	S2	0.5	1	2.584111
NL_MGGA_VP_001	PBE	S2	0.5	1	2.604407
NL_MGGA_VC_009	Chachiyo	S1	0.5	4	2.620007
NL_MGGA_VP_008	PBE	S2	1	4	2.620640
NL_MGGA_VC_008	Chachiyo	S2	1	4	2.665951
PBE	–	–	–	–	10.868648

Table 13. Table S7: Names and hyperparameters for CIDER functionals presented in the main text.

Name in Paper	Code Name	Baseline	$R_{1}$	$R_{2}$	Scheme	$A$	$D$
SL-GGA	SL_GGA_VC_000	Chachiyo	8	0.5	–	–	–
NL-GGA	NL_GGA_VC_004	Chachiyo	20.0	1.0	S1	1	0.5
SL-MGGA	SL_MGGA_VC_006	Chachiyo	64	0.5	–	–	–
NL-MGGA	NL_MGGA_VC_005	Chachiyo	1.0	1.0	S1	1	1
NL-MGGA-PBE	NL_MGGA_VP_007	PBE	0.05	1.0	S1	1	1
NL-MGGA-DTR	NL_MGGA_VP_DTR	PBE	0.1	1.0	S1	1	1

Table 14. Table S8: Average percentage density deviations from PBE0 (Eq. S2 with PBE0 densities for n i , ref superscript 𝑛 𝑖 ref n^{i,\text{ref}} in Eq. S1 ) for semilocal and ML functionals. The ML functionals use the PBE0/CIDER surrogate hybrid form. Whether each subdatabase was used for training or testing is indicated.

	PBE	r²SCAN	SL-GGA	NL-GGA	SL-MGGA	NL-MGGA	NL-MGGA-DTR¹¹¹¹11Note that all sub-databases contain both train and test data for NL-MGGA-DTR, so the train-test partition does not apply for this functional.
W4-11 (Train)	0.86	0.32	0.80	0.72	0.49	0.34	0.32
G21IP (Train)	0.72	0.41	0.75	0.69	0.89	0.65	0.59
BH76 (Train)	1.02	0.44	1.06	0.77	0.68	0.52	0.51
G21EA (Test)	0.80	0.36	0.91	0.58	0.64	0.41	0.38
FH51 (Test)	0.68	0.25	0.63	0.59	0.49	0.37	0.30
DIPCS10 (Test)	0.61	0.28	0.62	0.48	0.52	0.30	0.27
INV24 (Test)	0.68	0.30	0.59	0.57	0.55	0.42	0.36
PX13 (Test)	0.81	0.21	1.02	0.62	0.38	0.28	0.30

Table 15. Table S9: Bond lengths for diatomic molecules in Å. Note that these molecules are contained in the CIDER training set. All ML functionals use the PBE0/CIDER surrogate hybrid form.

	PBE	r²SCAN	SL-GGA	NL-GGA	SL-MGGA	NL-MGGA	NL-MGGA-DTR	PBE0
H2	0.750	0.741	0.750	0.747	0.746	0.742	0.744	0.745
N2	1.102	1.093	1.104	1.096	1.095	1.089	1.094	1.089
O2	1.218	1.206	1.219	1.206	1.205	1.194	1.200	1.192
F2	1.413	1.397	1.417	1.403	1.390	1.385	1.385	1.375
HF	0.930	0.921	0.930	0.922	0.919	0.916	0.918	0.918
CO	1.135	1.126	1.139	1.129	1.130	1.123	1.127	1.122
MAD vs. PBE0	0.018	0.008	0.020	0.010	0.007	0.003	0.005	—

Table 16. Table S10: Atomization energies for diatomic molecules in kcal/mol. Note that these molecules are contained in the CIDER training set. All ML functionals use the PBE0/CIDER surrogate hybrid form.

	PBE	r²SCAN	SL-GGA	NL-GGA	SL-MGGA	NL-MGGA	NL-MGGA-DTR	PBE0
H2	104.7	107.7	104.1	105.5	104.7	104.3	104.6	104.4
N2	243.6	220.3	223.8	222.5	221.3	223.1	223.2	225.6
O2	143.7	129.9	129.5	121.8	125.7	122.8	126.8	124.6
F2	52.8	38.7	45.7	35.8	34.7	34.9	38.1	35.1
HF	142.1	138.3	138.0	139.4	136.6	137.0	138.2	136.9
CO	268.9	255.6	256.0	253.0	254.8	254.8	255.6	255.6
MAD vs. PBE0	12.3	3.1	3.2	2.1	1.2	0.9	1.5	—

Table 17. Table S11: Lattice constant predictions for the SOL62 database [ 67 , 33 ] using functionals including PBE0/NL-MGGA-DTR, in units of Å. The midrule bars separate the training, validation, and test sets, in that order from top to bottom. Except for the mean absolute error reported separately for the train, validation, and test sets, error metrics are for the entire dataset.

	PBE	r²SCAN	PBE0/NL-MGGA-DTR	HSE06 [67]	Experimental
Li	3.437	3.484	3.473	3.466	3.477
C	3.572	3.560	3.567	3.549	3.567
GaAs	5.769	5.680	5.717	5.692	5.641
LiF	4.096	4.032	4.059	4.003	4.010
MgO	4.267	4.223	4.213	4.208	4.207
Pt	3.968	3.950	3.952	3.949	3.916
Pd	3.941	3.919	3.956	3.929	3.881
TiC	4.340	4.337	4.350	4.307	4.330
ZrC	4.703	4.709	4.707	4.700	4.696
Rb	5.672	5.745	5.834	5.756	5.585
Na	4.200	4.225	4.242	4.215	4.225
Sr	6.031	6.077	6.108	6.092	6.048
MgS	5.228	5.207	5.192	5.207	5.202
InP	5.954	5.906	5.923	5.907	5.861
AlSb	6.221	6.175	6.182	6.184	6.128
LiCl	5.156	5.133	5.096	5.107	5.106
GaSb	6.213	6.127	6.159	6.152	6.082
Ge	5.767	5.679	5.708	5.702	5.652
AlP	5.512	5.482	5.484	5.482	5.458
InAs	6.180	6.106	6.133	6.125	6.036
Cu	3.644	3.592	3.669	3.633	3.603
Ir	3.867	3.848	3.848	3.840	3.835
Nb	3.305	3.303	3.324	3.305	3.296
Rh	3.839	3.821	3.839	3.797	3.798
Ta	3.319	3.311	3.318	3.328	3.301
ZnSe	5.732	5.656	5.712	5.700	5.668
CdTe	6.599	6.545	6.585	6.574	6.480
HfC	4.659	4.649	4.645	4.647	4.638
TiN	4.255	4.244	4.250	4.213	4.239
ZnTe	6.169	6.097	6.149	6.152	6.104
ZrN	4.587	4.584	4.572	4.575	4.585
Ba	5.108	5.172	5.221	5.087	5.007
Ca	5.512	5.549	5.580	5.582	5.565
Al	4.041	3.991	4.016	4.020	4.032
K	5.286	5.351	5.406	5.313	5.225
GaN	4.603	4.542	4.561	4.504	4.520
GaP	5.545	5.479	5.509	5.467	5.442
BN	3.626	3.611	3.616	3.601	3.607
Si	5.475	5.450	5.456	5.444	5.430
SiC	4.389	4.365	4.372	4.353	4.358
AlAs	5.736	5.683	5.694	5.694	5.652
BP	4.551	4.539	4.549	4.530	4.538
NaF	4.701	4.611	4.646	4.635	4.609
InSb	6.620	6.548	6.577	6.561	6.469
NaCl	5.696	5.616	5.622	5.645	5.595
BAs	4.815	4.787	4.801	4.781	4.777
Sn	6.641	6.560	6.591	6.580	6.482
Ni	3.530	3.498	3.554	3.512	3.513
Fe	2.844	2.874	2.926	2.897	2.861
Ag	4.147	4.111	4.192	4.153	4.069
Mo	3.160	3.155	3.165	3.145	3.144
W	3.183	3.173	3.175	3.173	3.162
Au	4.177	4.148	4.169	4.153	4.065
V	3.003	2.998	3.022	2.970	3.028
HfN	4.552	4.532	4.522	4.531	4.519
VN	4.132	4.115	4.120	4.073	4.135
VC	4.168	4.156	4.164	4.112	4.160
CdS	5.920	5.877	5.914	5.885	5.818
NbN	4.412	4.408	4.397	4.392	4.379
CdSe	6.188	6.130	6.173	6.147	6.050
NbC	4.478	4.478	4.472	4.461	4.470
ZnS	5.445	5.380	5.435	5.419	5.404
Train MAE	0.050	0.020	0.031	0.022	-
Val. MAE	0.057	0.029	0.050	0.039	-
Test MAE	0.058	0.036	0.052	0.038	-
ME	0.051	0.025	0.046	0.025	-
MAE	0.057	0.031	0.048	0.036	-
RMSE	0.071	0.046	0.070	0.050	-
MAXE¹²¹²12Maximum absolute error.	0.159	0.165	0.249	0.171	-
MAXE Syst.¹³¹³13System with the maximum absolute error.	Sn	Ba	Rb	Rb	-

Table 18. Table S12: Bulk data for silicon for different functionals. The silicon experimental lattice constant was taken from Ref. [ 67 ] .

Functional	Lattice Constant (Å)	Band Gap (eV)
PBE	$5.475$	$0.61$
PBE0/NL-MGGA-DTR	$5.457$	$1.08$
PBE0(0.3)/NL-MGGA-DTR	$5.454$	$1.17$
Experimental	$5.430$	$1.17$

Table 19. Table S13: Transition levels for several defects in silicon for different functionals and correction schemes. “Fr.” is the Freysoldt correction scheme, while “Pot.” is the potential alignment-only scheme. All PBE0( α 𝛼 \alpha )/CIDER calculations use the NL-MGGA-DTR exchange functional. CIDER refers to PBE0(0.3)/NL-MGGA-DTR. The experimental references are from the following sources: vacancy [ 92 ] , boron [ 93 ] , phosphorus [ 94 ] , copper [ 95 ] , and sulfur [ 96 , 97 ] .

Defect	Level	PBE		PBE0( $α$ )/CIDER				Expt.
				$α = 0.25$		$α = 0.30$
		Fr.	Pot.	Fr.	Pot.	Fr.	Pot.
vac	++/+	-0.09	0.06	-0.01	0.14	-0.01	0.14	0.13
vac	+/0	-0.04	0.01	0.05	0.10	0.08	0.13	0.05
vac	++/0	-0.06	0.04	0.02	0.12	0.03	0.13	0.09
vac	0/-	0.46	0.41	0.71	0.66	0.76	0.71	—
vac	-/–	0.50	0.35	0.84	0.69	0.92	0.77	—
vac	0/–	0.48	0.38	0.77	0.68	0.84	0.74	—
P	+/0	0.53	0.58	0.99	1.03	1.08	1.13	1.12
B	0/-	0.07	0.02	0.07	0.02	0.08	0.03	0.04
Cu	+/0	0.04	0.09	0.03	0.07	0.02	0.07	0.21
Cu	0/-	0.24	0.19	0.53	0.48	0.58	0.53	0.48
Cu	-/–	0.66	0.52	0.91	0.76	0.97	0.82	1.00
S	++/+	0.12	0.27	0.47	0.61	0.54	0.69	0.58
S	+/0	0.41	0.46	0.83	0.88	0.92	0.97	0.85

Table 20. Table S14: +/0 transition level of the copper substitutional defect in silicon (excluding finite-size effects), computed using 64-atom supercells and different k-point meshes. CIDER refers to PBE0(0.3)/NL-MGGA-DTR.

k-point mesh	Functional	$E (+ / 0)$ (eV)
$2 \times 2 \times 2$ $Γ$	HSE06	0.19
$2 \times 2 \times 2$ $Γ$	CIDER	0.15
$2 \times 2 \times 2$ MP	HSE06	0.25
$2 \times 2 \times 2$ MP	CIDER	0.23
$4 \times 4 \times 4$ $Γ$	HSE06	0.10
$4 \times 4 \times 4$ $Γ$	CIDER	0.14

Equations253

E_{x} [n] = \int \differential [3] r e_{x}^{LDA} (n (r)),

E_{x} [n] = \int \differential [3] r e_{x}^{LDA} (n (r)),

n (r) = i \sum ∣ ϕ_{i} (r) ∣^{2} .

n (r) = i \sum ∣ ϕ_{i} (r) ∣^{2} .

e_{x}^{LDA} (n) = - \frac{3}{4} (\frac{3}{π})^{1/3} n^{4/3} .

e_{x}^{LDA} (n) = - \frac{3}{4} (\frac{3}{π})^{1/3} n^{4/3} .

E_{x} [n_{λ}] = λ E_{x} [n],

E_{x} [n_{λ}] = λ E_{x} [n],

E_{x} [n] = \int \differential [3] r e_{x}^{LDA} (n (r)) F_{x} (x (r)) .

E_{x} [n] = \int \differential [3] r e_{x}^{LDA} (n (r)) F_{x} (x (r)) .

x_{i} [n_{λ}] (r) = x_{i} [n] (λ r) .

x_{i} [n_{λ}] (r) = x_{i} [n] (λ r) .

s = \frac{∣\nabla n ∣}{2 ( 3 π ^{2} ) ^{1/3} n ^{4/3}} .

s = \frac{∣\nabla n ∣}{2 ( 3 π ^{2} ) ^{1/3} n ^{4/3}} .

E_{x} [n] = \int \differential [3] r e_{x}^{LDA} (n (r)) F_{x} (s (r))

E_{x} [n] = \int \differential [3] r e_{x}^{LDA} (n (r)) F_{x} (s (r))

α = \frac{τ - τ _{W}}{τ _{0}}

α = \frac{τ - τ _{W}}{τ _{0}}

τ (r) = \frac{1}{2} i \sum ∣ \nabla ϕ_{i} (r) ∣^{2},

τ (r) = \frac{1}{2} i \sum ∣ \nabla ϕ_{i} (r) ∣^{2},

E_{x} [n] = \int \differential [3] r

E_{x} [n] = \int \differential [3] r

\times F_{x} (s (r), α (r), {G_{n l m} (r)}) .

G_{n l m} (r)

G_{n l m} (r)

g_{n l m} (r^{'}; r)

g_{n l m} (r^{'}; r)

a [n] (r)

a [n] (r)

a [n_{λ}] (r) = λ^{2} a [n] (λ r) .

a [n_{λ}] (r) = λ^{2} a [n] (λ r) .

G_{i} (r_{1}) =

G_{i} (r_{1}) =

\times \int \differential [3] r_{2} Φ (a (r_{2}), b_{i} (r_{1}), r_{12}) n (r_{2})

Φ (a, b, r) =

a [n] (r)

a [n] (r)

b_{i} [n] (r)

a [n] (r)

a [n] (r)

b_{i} [n] (r)

\functionalderivative E_{xc} n (r) = \partialderivative e_{xc} (r) n (r) .

\functionalderivative E_{xc} n (r) = \partialderivative e_{xc} (r) n (r) .

\functionalderivative E_{xc} τ (r) = \partialderivative e_{xc} (r) τ (r) .

\functionalderivative E_{xc} τ (r) = \partialderivative e_{xc} (r) τ (r) .

V_{μν}^{xc} =

V_{μν}^{xc} =

+ \frac{1}{2} \int \differential [3] r \nabla χ_{μ}^{*} (r) \cdot \nabla χ_{ν} (r) \functionalderivative E_{xc} τ (r) .

E_{xc} = \int \differential [3] r e_{xc} (n (r), ∣\nabla n (r) ∣, τ (r), G (r)),

E_{xc} = \int \differential [3] r e_{xc} (n (r), ∣\nabla n (r) ∣, τ (r), G (r)),

\functionalderivative E_{xc} n (r) =

\functionalderivative E_{xc} n (r) =

+ \int \differential [3] r^{'} \partialderivative e_{xc} (r^{'}) G (r^{'}) \functionalderivative G (r^{'}) n (r)

\functionalderivative E_{xc} τ (r) =

+ \int \differential [3] r^{'} \partialderivative e_{xc} (r^{'}) G (r^{'}) \functionalderivative G (r^{'}) τ (r)

E_{xc}^{i} = g \in i \sum w_{g}^{i} e_{xc} (x_{g}^{i}),

E_{xc}^{i} = g \in i \sum w_{g}^{i} e_{xc} (x_{g}^{i}),

Cov (a + b, c + d) =

Cov (a + b, c + d) =

+ Cov (a, d) + Cov (b, d) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · X-ray Diffraction in Crystallography · Surface and Thin Film Phenomena

Full text

\NewDocumentCommand\MakeTitleInner

+m +m +m

#1

#2

#3

\NewDocumentCommand\MakeTitle

+m +m +m \MakeTitleInner#1#2#3

Nonlocal Machine-Learned Exchange Functional for Molecules and Solids

Kyle Bystrom

[email protected]

Harvard John A. Paulson School of Engineering and Applied Sciences

Boris Kozinsky

[email protected]

Harvard John A. Paulson School of Engineering and Applied Sciences

Robert Bosch LLC Research and Technology Center, Cambridge, MA, USA

Abstract

The design of better exchange-correlation functionals for Density Functional Theory (DFT) is a central challenge of modern electronic structure theory. However, current developments are limited by the mathematical form of the functional, with efficient semilocal functionals being inaccurate for many technologically important systems and the more accurate hybrid functionals being too expensive for large solid-state systems due to the use of the exact exchange operator. In this work, we use machine learning combined with exact physical constraints to design an exchange functional that is both orbital-dependent and nonlocal, but which can be evaluated at roughly the cost of semilocal functionals and is significantly faster than hybrid DFT in plane-wave codes. By training functionals with several different feature sets, we elucidate the roles of orbital-dependent and nonlocal features in learning the exchange energy and determine that both types of features provide vital and independently important information to the model. Having trained our new exchange functional with an expressive, nonlocal feature set, we substitute it into existing hybrid functionals to achieve hybrid-DFT accuracy on thermochemical benchmark sets and improve the accuracy of band gap predictions over semilocal DFT. To demonstrate the scalability of our approach as well as the practical benefits of improved band gap prediction, we compute charged defect transition levels in silicon using large supercells. Due to its transferability and computational efficiency for both molecular and extended systems, our model overcomes the cost-accuracy trade-off between semilocal and hybrid DFT, and our general approach provides a feasible path toward a universal exchange-correlation functional with post-hybrid DFT accuracy and semilocal DFT cost.

††preprint: APS/123-QED

I Introduction

One of the central challenges of density functional theory (DFT) research is to design accurate approximations to the exchange-correlation (XC) functional [1]. As the one unknown term in Kohn-Sham DFT [2], errors in the XC functional limit the predictive power of DFT for a wide variety of technological problems, including catalysis [3], battery materials [4], and semiconductor physics [5, 6]. Increasing the complexity of the XC functional can improve its accuracy, but often at drastically increased computational cost, which can be prohibitive for large condensed matter simulations. In this work, we use machine learning (ML) to overcome this cost-accuracy trade-off by training a functional that obeys exact physical constraints and is accurate, transferable across a broad range of chemistries, and scalable to hundreds of atoms.

The key cost-accuracy trade-off we address is that between semilocal and hybrid XC functionals. Semilocal functionals, which include the local density (LDA), generalized gradient (GGA), and meta-generalized gradient (meta-GGA) approximations [7], are computationally efficient but insufficiently accurate for many applications. For example, LDA and GGA functionals are incapable of correctly predicting band gaps due to their lack of derivative discontinuity [8, 9], and they are of limited accuracy for many molecular systems [10, 1]. Meta-GGAs introduce derivative discontinuities into the XC functional and can therefore improve band gap predictions, but they typically account for only 20-50% of the GGA band gap error [11]. Meta-GGAs can be designed to further improve band gap prediction [12, 13], but this typically makes them less accurate for other chemical and material properties [14, 13]. On the other hand, hybrid functionals, which mix a fraction of the nonlocal, orbital-dependent exact exchange energy into an XC functional [15], provide significantly improved predictions of band gaps [11, 16] and thermochemical properties [10, 1] over semilocal DFT, but this accuracy comes at a steep additional computational cost, especially within plane-wave DFT. In spite of the remarkable recent progress of efficient implementations for solids [17, 18, 19, 20, 21, 22], hybrid DFT is not as efficient or practical as semilocal DFT for these calculations.

Several approaches to design improved XC functionals have been pursued in the literature. Many popular functionals, like PBE [23] and SCAN [24], are designed primarily or entirely based on mathematical constraints obeyed by the exact functional. Constraint-based functionals are often considered trustworthy because they can predict molecular properties without being explicitly tuned to match them, reducing the risk of bias toward a particular property or set of systems [25, 26]. However, some of the most popular and accurate XC functionals for main-group chemistry are empirically fit to chemical data [27, 28]. Recently, significant progress has been made toward combining constraint and data-based design approaches using machine learning [29, 30, 31, 32, 33, 34, 35], culminating in the development of the DM21 functional [36], which achieves state-of-the-art accuracy on main-group chemistry benchmarks and also improves the description of static correlation by fitting fractional charge constraints. However, these developments have a few key shortcomings, the most notable of which is their computational cost. DM21 uses a local range-separated hybrid formalism, which is both computationally expensive (even more so than global hybrid DFT) and not widely available for the plane-wave DFT codes typically used for investigating large extended systems. As a result, DM21 has not been applied to solids. To the best of our knowledge, no nonlocal machine-learned functional has yet been implemented for both molecular (Gaussian-type orbital) and extended (plane-wave) DFT calculations. All ML XC functionals that have been applied to solids have been limited to the semilocal form [37, 38, 39, 33, 13, 32] (with the exception of minimally parameterized nonlocal van der Waals terms). This limitation restricts the application space of ML functionals by preventing their use in applications where hybrid DFT accuracy is needed. For example, hybrid DFT has proven useful for studying problems in catalysis within both Gaussian-type orbital and plane-wave codes [40, 41] and for addressing the “band-gap problem” in DFT [42]. In addition to this technical limitation, applications of ML functionals to complex and heterogeneous systems are limited by their lack of transferability and universality, as they have previously only been trained on molecular systems.

These limitations raise the question of whether machine learning models can leverage more efficient features to balance accuracy and computational cost. One possible approach in this direction is to generate nonlocal features by taking integrals of the density over some set of kernel functions [43, 30]. This way, an ML model of the XC functional can capture nonlocal information about the density without evaluating the computationally demanding exact exchange operator required for hybrid DFT. In line with this strategy, we recently introduced the Compressed Scale-Invariant Density Representation (CIDER) formalism for learning the exchange functional [35], in which the nonlocal features were designed to enforce the uniform scaling rule [44] on the learned exchange functional. We demonstrated that the exchange functional can be accurately learned with the CIDER approach. However, the high computational cost of computing the features, lack of implementation for plane-wave DFT, and need to compute and fit the exchange energy density (rather than the total exchange energy) limited the practicality of our model and restricted its application to small molecules.

In this work, we overcome all of these limitations by revising the form of the features, implementing efficient algorithms to evaluate the new features for both molecular and extended systems, and redesigning our Gaussian process model to fit the total exchange energy. These improvements enable the construction of the first nonlocal ML exchange functional for solids, which can be applied to large systems with similar computational cost to semilocal DFT. We use this exchange functional to implement efficient and accurate hybrid DFT surrogates that can be used in a variety of applications. Because our new functional form is both transferable and numerically stable, we believe that our approach will also be useful for learning the full XC functional, allowing us to address problems where DFT accuracy is limited by both exchange and correlation effects.

By training and testing exchange functionals with different feature sets, we find that to learn the exchange functional well enough to reproduce hybrid DFT accuracy, it is necessary to include both a semilocal orbital-dependent feature (i.e. the kinetic energy density) and nonlocal density-dependent features in the model input. Having developed this “nonlocal meta-GGA,” we show that it accurately predicts molecular properties and improves solid-state band gap predictions over standard meta-GGAs, which is a central problem in DFT. The performance for solid-state band gaps is particularly promising because only 9 systems in the training set are solids, with the others being isolated molecules and complexes. We also encounter only a few convergence problems in our self-consistent field (SCF) calculations of 2462 molecular and 453 periodic systems, indicating that the functional has good numerical stability. To demonstrate the transferability and efficiency of the model, we simulate systems with over 500 atoms to predict charged defect transition levels in silicon.

The paper is organized as follows. Section II describes the features and models used in this work. Specifically, Sections II.1-II.3 cover the design of the exchange functional form, nonlocal features, and Gaussian process models, and Sections II.4-II.6 cover the computationally efficient implementation of the nonlocal features. The feature implementation requires different algorithms for Gaussian-type orbital and plane-wave DFT codes due to the use of different basis sets and integration grids within these codes. Section III presents our results and discusses the accuracy of the CIDER functionals as well as their computational efficiency within plane-wave DFT. Section IV contains our conclusion. Appendix A provides a detailed description of the computational methods used in this study, and Appendix B provides additional mathematical details for the plane-wave implementation of CIDER.

II Theory

II.1 Uniform Scaling and Semilocal Exchange Functionals

The simplest approximate exchange functional is the local density approximation (LDA), in which the exchange energy is written as an integral over a function of the density $n(\mathbf{r})$ at each point in space

[TABLE]

where in the Kohn-Sham formalism [2], the density is expressed in terms of the occupied Kohn-Sham orbitals $\{\phi_{i}(\mathbf{r})\}$ as

[TABLE]

Typically, $e_{\text{x}}^{\text{LDA}}\left(n\right)$ is defined as the exchange energy density of the uniform electron gas [45]

[TABLE]

In addition to making Eq. 1 exact for the uniform electron gas, Eq. 3 also satisfies an important property of the exchange functional known as uniform scaling [44, 46], which states that for a scaled density distribution $n_{\lambda}(\mathbf{r})=\lambda^{3}n(\lambda\mathbf{r})$ ,

[TABLE]

where $\lambda$ is a positive scalar.

More sophisticated functionals that obey Eq. 4 can be designed by writing the exchange energy density as the product of Eq. 3 and an exchange enhancement factor $F_{\text{x}}$ :

[TABLE]

In the above equation, $\mathbf{x}(\mathbf{r})$ is a vector of scale-invariant features $x_{i}$ , where scale-invariance is defined as

[TABLE]

For example, a generalized gradient approximation (GGA) exchange functional is constructed by making $F_{\text{x}}$ a function of the reduced gradient $s$ of the density

[TABLE]

Because $s$ satisfies Eq. 6, the GGA form

[TABLE]

satisfies Eq. 4. Likewise, a meta-GGA can be constructed by making the exchange enhancement factor a function of two ingredients $F_{\text{x}}(s,\alpha)$ , where

[TABLE]

is a scale-invariant quantity that depends on the kinetic energy density

[TABLE]

and $\tau_{W}=\frac{|\nabla n|^{2}}{8n}$ and $\tau_{0}=\frac{3}{10}(3\pi^{2})^{2/3}n^{5/3}$ are the single-orbital and uniform electron gas kinetic energies, respectively.

II.2 Updated Nonlocal Features for CIDER

In our initial construction of the CIDER formalism [35], the exchange functional is defined as

[TABLE]

In Eq. 11, the functions $\{G_{nlm}(\mathbf{r})\}$ are nonlocal features of the density defined by

[TABLE]

where $g_{nlm}(\mathbf{r}^{\prime};\mathbf{r})$ is the kernel function

[TABLE]

with $M_{l}=B_{0}^{3/2}\sqrt{4\pi^{l-1}}\left(\frac{8\pi}{3}\right)^{\frac{l}{3}}$ . The position-dependent exponent $a(\mathbf{r})$ is defined as a semilocal quantity

[TABLE]

where $B_{0}$ and $C_{0}$ are adjustable constants that control how quickly the width of the Gaussian-type function in Eq. 13 decreases as the density and kinetic energy density increase. Equation 14 behaves quadratically under uniform scaling, i.e.

[TABLE]

This property ensures that $G_{nlm}(\mathbf{r})$ is scale-invariant (Eq. 6) and therefore that Eq. 11 satisfies the uniform scaling rule (Eq. 4).

Using these scale-invariant nonlocal features, we were able to train an ML exchange functional that accurately reproduced hybrid DFT atomization energies [35], but there is room for improvement in making the features more physically intuitive and easier to implement with production-level performance. In this work, we simplify the features by removing the spherical harmonics and instead adding a second position-dependent exponent to the kernel function, resulting in the new features

[TABLE]

where $r_{12}=|\mathbf{r_{1}}-\mathbf{r_{2}}|$ . Multiple features $G_{1},G_{2},...$ are generated by evaluating Eq. 16 with different choices of exponent $b_{i}(\mathbf{r})$ , which in turn are tuned by constants $B_{i}$ and $C_{i}$ as discussed below. Because the position-dependent exponents increase as the density increases, the $a$ exponent effectively damps the value of the integrand when $\mathbf{r}_{2}$ is near the atomic core, which we intend to limit the interaction between the valence region and core region. The prefactor $\left(\frac{B_{i}+B_{0}}{2}\right)^{3/2}$ is chosen so that $G_{i}=2$ for the non-spin-polarized uniform electron gas.

To retain the scale-invariance of the features, any choice of $a[n](\mathbf{r})$ and $b_{i}[n](\mathbf{r})$ that satisfies Eq. 15 is sufficient. For example, the exponents can be an orbital-independent function of the density and its gradient

[TABLE]

or can be made kinetic-energy dependent, as in our previous work:

[TABLE]

Equations 18 and 19 are used for the nonlocal GGA presented in this work (c.f. Section II.3), while Eqs. 20 and 21 are used for the nonlocal meta-GGA. Throughout Sections II.4–II.6, the $\left(\frac{B_{i}+B_{0}}{2}\right)^{3/2}$ factor and the feature index $i$ will be dropped for simplicity, as the inclusion of the scaling factor and the introduction of additional features by introducing new $B_{i}$ and $C_{i}$ is trivial.

Before continuing, we briefly discuss the construction of the XC potential when nonlocal features are employed. A typical semilocal functional has a multiplicative XC potential

[TABLE]

A meta-GGA also has a non-multiplicative potential

[TABLE]

This above term is non-multiplicative because it depends on the orbitals through $\tau(\mathbf{r})$ , rather than depending exclusively on the density $n(\mathbf{r})$ . For a given basis set $\left\{\chi_{\mu}(\mathbf{r})\right\}$ , the contributions of the XC potential to the effective Kohn-Sham Hamiltonian matrix are

[TABLE]

When nonlocal features are included in the XC energy density, e.g.

[TABLE]

both $\functionalderivative{E_{\text{xc}}}{n(\mathbf{r})}$ and $\functionalderivative{E_{\text{xc}}}{\tau(\mathbf{r})}$ include terms that depend on other coordinates $\mathbf{r}^{\prime}$ :

[TABLE]

In Eq. 26, $\partialderivative{G(\mathbf{r})}{n(\mathbf{r})}$ describes the local dependence of $G(\mathbf{r})$ on the density, and $\functionalderivative{G(\mathbf{r}^{\prime})}{n(\mathbf{r})}$ describes the nonlocal dependence on the density. While the expressions for $\functionalderivative{E_{\text{xc}}}{n(\mathbf{r})}$ and $\functionalderivative{E_{\text{xc}}}{\tau(\mathbf{r})}$ are more complicated than for a meta-GGA, they still contribute to the XC potential matrix exclusively via Eq. 24. Therefore, any DFT code supporting meta-GGAs can evaluate this type of functional if a routine is added to evaluate the nonlocal contributions to Eqs. 26 and 27.

II.3 Gaussian Process Models for the Exchange Functional

II.3.1 Overview

To study the impact of different features on the accuracy of an ML exchange functional, we implement four functional types that use different sets of features.

SL-GGA is a standard, semilocal GGA like PBE [23], except trained via machine learning. The exchange enhancement factor ( $F_{\text{x}}$ in Eq. 5) is a function of $s$ only.

NL-GGA is a nonlocal GGA. The exchange enhancement factor is a function of $s$ and also three nonlocal features of the form in Eq. 16, with Eqs. 18 and 19 for the exponents.

SL-MGGA is a standard, semilocal meta-GGA like SCAN [24], except trained via machine learning. The exchange enhancement factor is a function of $s$ and $\alpha$ .

NL-MGGA is a nonlocal meta-GGA. The exchange enhancement factor is a function of $s$ , $\alpha$ , and three nonlocal features of the form in Eq. 16, with Eqs. 20 and 21 for the exponents.

In short, the SL-GGA and SL-MGGA are semilocal functionals, while the NL-GGA and NL-MGGA augment the semilocal functional form with nonlocal features of the density. As described below, Gaussian process models for the exchange energy are trained using each of these four feature sets. Section II.3.2 explains how a Gaussian process can be fit to total exchange energies of systems, Section II.3.3 lists the specific mathematical form of each feature vector that is used as input to the Gaussian processes, and Section II.3.4 lists the covariance kernels for each feature set.

II.3.2 Fitting the Total Exchange Energy

In this section, we introduce a method to train a Gaussian process to the total exchange energy, as opposed to the nonunique exchange energy density used in our previous work [35]. While we only learn the exchange energy in this work, the approach in this section also applies to the XC energy, so we write the formulas for $E_{\text{xc}}$ for the sake of generality. We index the chemical systems in the training set by $i$ and $j$ ; while the feature implementations are different for Gaussian-type orbital and plane-wave DFT, the role of systems computed within these two frameworks is identical with regard to model training. The XC energy of a system $i$ is given by a numerical integral of the energy density

[TABLE]

where $g$ indexes grid points, $w_{g}^{i}$ are the quadrature weights, and $\mathbf{x}_{g}^{i}$ is the feature vector of molecule $i$ at point $g$ .

Constructing a Gaussian process for $E_{\text{xc}}$ requires defining the covariance between the XC energies for two different systems, i.e. $\text{Cov}(E_{\text{xc}}^{i},E_{\text{xc}}^{j})$ (c.f. Chapter 2.2 of Ref. [47]). The covariance of sums of random variables $a,b,c$ , and $d$ is

[TABLE]

Using Eq. 29, $\text{Cov}(E_{\text{xc}}^{i},E_{\text{xc}}^{j})$ can be expressed in terms of the covariances between the exchange energy densities as

[TABLE]

where $k_{e_{\text{xc}}}(\mathbf{x},\mathbf{x}^{\prime})$ is the kernel function for the covariance between the exchange energy densities for feature vectors $\mathbf{x}$ and $\mathbf{x}^{\prime}$ . Unfortunately, this expression can be quite expensive to compute because there are tens of thousands of grid points for each small molecule in the training database, and there are several hundred training molecules. Therefore, even for a database of this size, at least $10^{12}$ evaluations of the kernel are required. To circumvent this problem, a set of control points $\mathbf{\tilde{x}}_{a}$ can be sampled from the training database to use as a basis set for the feature space (as described in Appendix A.6), effectively creating a sparse Gaussian process [47]. Equation 31 is then approximated as

[TABLE]

In this work, the exchange functional alone is trained, and the kernel function is expressed in terms of the exchange enhancement factor $F_{\text{x}}=e_{\text{x}}/e_{\text{x}}^{\text{LDA}}$ . This leads to the adjusted expressions

[TABLE]

where $k_{F_{\text{x}}}(\mathbf{x},\mathbf{x}^{\prime})$ is the covariance kernel for the exchange enhancement factor and $(e_{\text{x}}^{\text{LDA}})_{g}^{i}=e_{\text{x}}^{\text{LDA}}\left(n_{i}(\mathbf{r}_{g})\right)$ is the LDA exchange energy density at coordinate $g$ for molecule $i$ (see Eq. 3). Having obtained these covariance expressions, they can be plugged into the predictive function for a Gaussian process to obtain the predicted exchange energy density [47]

[TABLE]

where $n_{*}$ and $\mathbf{x}_{*}$ are the density and feature vector at a test point and $\Sigma_{\text{noise}}$ is the noise matrix for the training points.

II.3.3 Model Types and Feature vectors

This section lists the features vectors for each model type in terms of $s$ , $\alpha$ , and the nonlocal features $G_{i}$ . All of the feature vectors below are designed such that for the uniform electron gas, $\mathbf{x}=\mathbf{0}$ . Since the exchange enhancement factor is $F_{\text{x}}=1$ for the uniform electron gas [45], the uniform electron gas constraint is enforced with a single noiseless training point $F_{\text{x}}^{\text{CIDER}}(\mathbf{0})=1$ . All functionals in this work satisfy the uniform electron gas constraint.

For the SL-GGA, the feature vector $\mathbf{x}$ (Eq. 5) is of dimension 1:

[TABLE]

where $\gamma=0.243$ . For the SL-MGGA, the feature vector is of dimension 2:

[TABLE]

The NL-GGA feature vector is of dimension 4:

[TABLE]

Lastly, the NL-MGGA feature vector is of dimension 5:

[TABLE]

The three nonlocal features $G_{1},G_{2},G_{3}$ are given by Eq. 16, and the parameterization of the kernel exponents (i.e. the constants $B_{i}$ and $C_{i}$ in Eqs. 18-21) is described in detail in Appendix A.8.

II.3.4 Kernels

To fully define the Gaussian process model for the exchange energy, we need to specify the kernel function $k_{F_{\text{x}}}(\mathbf{x},\mathbf{x}^{\prime})$ and the noise matrix $\Sigma_{\text{noise}}$ . The noise matrix is constructed to heuristically approximate the model uncertainty for each data point as described in Appendix A.7. To reflect the different number of features in each functional type, a different kernel is used for $k_{F_{\text{x}}}(\mathbf{x},\mathbf{x}^{\prime})$ for each, as specified below:

[TABLE]

where $\Sigma_{\text{Cov}}$ is a covariance hyperparamter that can be tuned for each model, and the base kernel $k_{i}(x,x^{\prime})$ for all of the above is the squared-exponential kernel

[TABLE]

Equations 54 and 55 use the additive Gaussian process approach of Duvenaud et al. [48].

Because the cost of evaluating Gaussian processes scales linearly with the number of training points, we map the models to cubic splines after training so they can be evaluated efficiently. The semilocal kernels have 1 and 2 features, so they can be mapped to 1 and 2-dimensional cubic splines, respectively. The nonlocal kernels are a sum of 3-dimensional kernels, so they can be mapped to a sum of 3-dimensional cubic splines [35, 49, 50, 51]. For a description of the hyperparameter selection process for these models, see Appendix A.8.

II.4 Auxiliary Expansion of the CIDER Features

The rest of this section focuses on the efficient implementation of nonlocal features within Gaussian-type orbital and plane-wave DFT, which is vital for the practicality of CIDER as well as any nonlocal density functional. Evaluating Eq. 16 via simple numerical integration is computationally expensive due to quadratic scaling with the number of grid points. The key insight to designing a faster approach comes from Román-Pérez and Soler [52], who developed a method to expand van der Waals density functionals as a sum of convolutions that could be computed efficiently using the Fast Fourier Transform (FFT). They observed that a function of the form in Eq. 17, which depends exclusively on the distance between two points and on a semilocal functional of the density at each point, can be approximately expanded as [52]

[TABLE]

In Eq. 57, $\alpha$ and $\beta$ are indices which each correspond to interpolation control points $q_{\alpha}$ and $q_{\beta}$ , respectively, and $p^{a}_{\alpha}(a(\mathbf{r}))$ and $p^{b}_{\beta}(b(\mathbf{r}))$ are interpolating functions which take the values $p^{a}_{\alpha}(q_{\alpha^{\prime}})=\delta_{\alpha\alpha^{\prime}}$ and $p^{b}_{\beta}(q_{\beta^{\prime}})=\delta_{\beta\beta^{\prime}}$ at the interpolating points. The term $\Phi_{\alpha\beta}(r)\equiv\Phi(q_{\alpha},q_{\beta},r)$ .

The Gaussian form of Eq. 17 is particularly conducive to the use of an even-tempered basis for interpolation, in which $q_{\alpha}$ takes the form

[TABLE]

for some constant $\lambda>1$ . Román-Pérez and Soler [52] used cubic splines for interpolation, and for the solid-state implementation of CIDER we follow this approach, with $p^{a}_{\alpha}(a(\mathbf{r}))$ being a cubic spline in the transformed coordinate $\gamma(\mathbf{r})=\frac{\ln(a(\mathbf{r})/q_{0})}{\ln(\lambda)}$ . However, for the molecular implementation, we instead treat the control points as a basis set of Gaussians $e^{-q_{\alpha}r_{12}^{2}}$ onto which the function $e^{-a(\mathbf{r_{1}})r_{12}^{2}}$ is projected:

[TABLE]

As $\lambda\rightarrow 1$ , the number of control points (for the interpolation scheme) and number of basis functions (for the basis expansion scheme) both become infinite, and the kernel expansion becomes exact for both approaches. We determined that the cubic interpolation yields better numerical stability for the plane-wave implementation, while the basis expansion form is sufficiently stable for the molecular implementation.

The benefit of expanding the kernel in this manner is that it allows the nonlocal features (Eq. 16) to be expressed as a sum of convolutions

[TABLE]

with $\theta_{\alpha}(\mathbf{r})=p^{a}_{\alpha}(a(\mathbf{r}))n(\mathbf{r})$ . Within plane-wave DFT, the convolutions can be performed with quasi-linear computational complexity by using FFTs [52]. In addition, because CIDER uses a squared exponential for $\Phi_{\alpha\beta}(r_{12})$ (Eq. 17), the integrals of Eq. 62 can be evaluated analytically if $\theta_{\alpha}(\mathbf{r})$ is projected onto a Gaussian-type orbital basis. We use this approach to design an efficient implementation of nonlocal features for molecular DFT (Section II.5).

The Román-Pérez and Soler technique effectively solves the problem of implementing CIDER for an all-electron plane-wave DFT calculation. However, such calculations are infeasible for all but the smallest atoms because a prohibitively large plane-wave basis would be required to describe the core electrons. In addition, pseudopotentials cannot be reliably used with nonlocal features because $G(\mathbf{r})$ calculated on the pseudo-density is not sufficiently similar to the exact, all-electron feature to get accurate results. To solve this problem, in Section II.6 a method is introduced to implement nonlocal features within the Projector-Augmented Wave (PAW) formalism, which requires careful attention to the contributions to $G(\mathbf{r})$ from the core regions around the atoms.

II.5 Efficient Implementation of Nonlocal CIDER Features for Molecular DFT

Computing Eq. 16 using simple numerical integration is prohibitively expensive because the number of operations is proportional to the square of the number of grid points. While van der Waals functionals like VV10 [53] are usually computed this way within Gaussian-type orbital codes, the contribution of such functionals is sufficiently small and smooth that a very sparse grid can be used. The nonlocal features require a grid of the same density as the semi-local XC grid, and the number of operations per pair of grid points is higher for evaluating Eq. 16 than, for example, the VV10 kernel. It is therefore necessary to reduce the computational complexity of evaluating the nonlocal features for molecular systems.

In this section, we present a charge-partitioning and density fitting-based approach to compute the nonlocal features within Gaussian-type orbital DFT codes for molecular systems, which we have implemented for use in the PySCF code [54, 55]. Much of this approach is based on the work of Franchini et al. [56], who developed a similar method for computing the Coulomb energy. To start, the density is partitioned into atomic contributions using Becke partitioning [57],

[TABLE]

where $A$ is the atom index and $W_{A}(\mathbf{r})$ are the atomic Becke weights. The first step in evaluating $G(\mathbf{r})$ is to project $\theta_{\alpha}(\mathbf{r})$ onto spherical harmonics for each atom (with $L$ being the combined $l$ and $m$ indices and $\mathbf{R}_{A}$ being the position of atom $A$ ):

[TABLE]

where $\Omega_{A}=(\theta,\phi)$ is the solid angle and $Y_{L}(\Omega)$ are the spherical harmonics. This projection is done numerically using Lebedev grids [58]. Then each $L$ channel is projected onto an even-tempered Gaussian basis $\{\chi_{A,L,\mu}(\mathbf{r})\}$ as follows:

[TABLE]

where $N_{l,\mu}$ is a normalization constant. A transformation to $\theta_{\alpha,L,\mu}^{A}$ is then performed by multiplying $\tilde{\theta}_{\alpha,L,\mu}^{A}$ by the inverse overlap matrix of the $\chi$ basis for each $l$ channel:

[TABLE]

For each atom, a second even-tempered basis $\{\xi_{A,L,\mu}(\mathbf{r})\}$ is introduced to serve as a basis for the convolutions of $\chi_{A,L,\mu}(\mathbf{r})$ . The following projections are computed using analytical Gaussian integrals:

[TABLE]

The transformation to $F_{\beta,L,\nu}^{A}$ is then performed by multiplying $\tilde{F}_{\beta,L,\nu}^{A}$ by the inverse overlap matrix of the $\xi$ basis, similarly to Eq. 69.

The second to last step is to transform $F_{\beta,L,\nu}^{A}$ to $F_{\beta}(\mathbf{r}_{g})$ , where $\mathbf{r}_{g}$ are the DFT integration grid points. Because evaluating every $\xi$ orbital of every atom at every $\mathbf{r}_{g}$ would be computationally expensive, $F_{\beta,L,\nu}^{A}$ is first transformed to a cubic spline $F_{\beta,L,G,p}^{A}$ , where $G$ is a radial coordinate index and $p$ indexes the polynomial coefficients of the spline in the interval between points $G$ and $G+1$ . The transformation from $F_{\beta,L,G,p}^{A}$ to $F_{\beta}(\mathbf{r}_{g})$ is then performed as

[TABLE]

where $\mathbf{r}_{gA}=\mathbf{r}_{g}-\mathbf{R}_{A}$ . The spline index $G$ is a step function of $|\mathbf{r}_{gA}|$ . The computational cost of evaluating the splines depends heavily on the efficiency of memory access patterns on the processor, so the coordinates are sorted by $G$ before evaluating the contributions to $F_{\beta}(\mathbf{r}_{g})$ from each atom.

The last and simplest step to compute $G(\mathbf{r})$ is to evaluate the sum

[TABLE]

After computing the XC energy from $G(\mathbf{r})$ and the semilocal features, the contributions to the XC potential from the $\theta_{\alpha}(\mathbf{r})$ terms can then be computed by passing backward through the matrix multiplications and spline evaluations detailed above.

The gradient of the total energy with respect to nuclear positions is required for the evaluation of forces. For this implementation of the CIDER functional with nonlocal features, the energy gradient with respect to the nuclear position of atom $A$ is given by

[TABLE]

where $w_{g}$ is the numerical integration quadrature weight corresponding to the coordinate $\mathbf{r}_{g}$ . The analytical energy gradient for a semilocal functional $\nabla_{A}^{\text{sl}}E_{\text{xc}}$ is given by Johnson et al. [59] Eq. 11.

The computational bottleneck of this algorithm is the spline evaluation in Eq. 72, whose cost scales quadratically with system size. Specifically, the cost scales as $\Theta(N_{\text{grid}}\,N_{\text{atom}}\,N_{\beta}\,l_{\text{max}}^{2})$ , where $N_{\text{grid}}$ is the number of integration grid points, $N_{\text{atom}}$ is the number of atoms, $N_{\beta}$ is the number of kernel interpolation points in Eq. 57, and $l_{\text{max}}$ is the maximum spherical harmonic order used in the $\xi$ basis. All other components of the algorithm are performed on individual atoms and therefore scale linearly with system size.

II.6 Quasi-linear-scaling Implementation of Nonlocal CIDER Features for Plane-Wave DFT with PAW

This section covers our algorithm for evaluating the nonlocal CIDER features within the PAW method, which we have implemented for use in the GPAW code [60, 61]. As discussed above, Eqs. 57 and 62 enable the evaluation of $G(\mathbf{r})$ in $O(N\log N)$ complexity if the density $n(\mathbf{r})$ is represented on an FFT grid. However, in the PAW formalism of Blöchl [62], the all-electron density $n(\mathbf{r})$ is represented as

[TABLE]

where $A$ indexes the atoms, $\tilde{n}(\mathbf{r})$ is the pseudo-density on the FFT grid, and the terms with the 1 superscript are represented on radial support grids centered around each atom and extending to some cutoff radius $r_{c}$ . Within the cutoff radius of atom $A$ , $\tilde{n}(\mathbf{r})=\tilde{n}_{A}^{1}(\mathbf{r})$ and $n_{A}^{1}(\mathbf{r})$ is the all-electron density. Outside the cutoff radius of atom $A$ , $n_{A}^{1}(\mathbf{r})=\tilde{n}_{A}^{1}(\mathbf{r})$ .

For a semilocal functional, Eq. 75 yields a simple expression for the XC energy $E_{\text{xc}}[n]$ :

[TABLE]

However, the case is not so simple for a nonlocal functional such as CIDER. If Eq. 16 is evaluated on the pseudo-density $\tilde{n}(\mathbf{r})$ , the resulting $G(\mathbf{r})$ is not equal to the all-electron feature even outside the cutoff radius because the convolutions used to compute the feature are nonlocal. For the same reason, evaluating Eq. 16 on the atomic density $n_{A}^{1}(\mathbf{r})$ does not yield the all-electron feature inside the cutoff radius. Therefore, an implementation of CIDER (and other nonlocal functionals with similar forms, such as some kinetic energy and van der Waals functionals) for PAW must construct a “pseudo-feature” $\tilde{G}(\mathbf{r})$ that is smooth but yields the correct feature outside the cutoff radius, as well as construct the “all-electron feature” $G(\mathbf{r})$ on the radial support grids inside the cutoff radius. Because of this difficulty, current plane-wave implementations of nonlocal functionals like van der Waals correlation [63] use the pseudo-density rather than the full PAW density. When this approach is used, one may need to use PAW datasets with more valence electrons, increasing computational cost [63]. In addition, because CIDER computes the exchange energy, which is of much larger magnitude than the vdW correlation energy, higher precision is required. A similar problem exists for kinetic energy density functionals. There is an implementation of semilocal kinetic energy functionals with PAW [64], and of nonlocal kinetic energy functionals with ultrasoft pseudopotentials [65], but no nonlocal kinetic energy functional implementation within PAW. Therefore, by designing an algorithm for computing nonlocal features for CIDER within the PAW formalism, we also enable the full integration of vdW functionals and OF-DFT into the PAW method.

Starting from Eq. 75, the all-electron value of $G(\mathbf{r})$ outside the core region is given by

[TABLE]

The quantities $\tilde{b}(\mathbf{r})$ and $\tilde{\theta}_{\alpha}(\mathbf{r})$ are evaluated on the pseudo-density $\tilde{n}(\mathbf{r})$ ; $\theta^{1}_{A,\alpha}(\mathbf{r})$ and $\tilde{\theta}^{1}_{A,\alpha}(\mathbf{r})$ are evaluated on $n_{A}^{1}(\mathbf{r})$ and $\tilde{n}_{A}^{1}(\mathbf{r})$ , respectively. $\tilde{F}_{\beta}(\mathbf{r})$ can be evaluated simply using FFTs. To compute $\Delta F_{A,\beta}(\mathbf{r})$ , the term $\Delta\theta^{1}_{A,\alpha}(\mathbf{r})=\theta^{1}_{A,\alpha}(\mathbf{r})-\tilde{\theta}^{1}_{A,\alpha}(\mathbf{r})$ is first projected onto spherical harmonics and Gaussians as described in the previous section. Because $\Delta\theta^{1}_{A,\alpha}(\mathbf{r})$ is localized within the augmentation region and mostly spherical, far fewer spherical harmonics and Gaussians are required per atom than for the Gaussian-type orbital implementation. This projection yields $\Delta\theta_{A,\alpha,L,\mu}^{1}$ . The Gaussians are then analytically projected into reciprocal space to yield $\Delta\theta_{A,\alpha,L}^{1}(k)$ . In reciprocal space,

[TABLE]

The above equation applies because the convolution $\Phi_{\alpha\beta}(k)$ has angular momentum number $l=0$ . Note that in this section, denoting a symbol as function of $k$ implies that it is the Fourier transform of the real-space function corresponding to that symbol. For example, $\Phi_{\alpha\beta}(k)$ is the Fourier transform of $\Phi_{\alpha\beta}(r)$ .

Inspired by the PAW method itself, we perform the following trick to approximate $\Delta F_{A,\beta}(\mathbf{r})$ on the FFT grid. We introduce two sets of functions, $\{g_{i}(\mathbf{r})\}$ and $\{h_{j}(\mathbf{r})\}$ . The $h_{j}(\mathbf{r})$ functions form a localized basis (i.e. each function vanishes for $r>r_{c}$ ) and are meant to fit the high-frequency components of $\Delta F_{A,\beta}(\mathbf{r})$ introduced by the all-electron density inside the core region. For the current implementation of these functions, we use differences between Gaussians and smooth polynomials that match their derivatives at $r_{c}$ . The other functions $g_{i}(\mathbf{r})$ are both localized and smooth, and are meant to augment $\Delta\theta^{1}_{A,\alpha}(\mathbf{r})$ to yield the correct values of $\Delta F_{A,\beta}(\mathbf{r})$ outside of the augmentation region. To do this, least-squares linear regression is used to fit the following function to $\Delta F_{A,\beta}(\mathbf{r})$ :

[TABLE]

where the coefficients $C_{i\alpha}$ and $D_{j\beta}$ are determined by the regression. The necessary integrals are actually performed in reciprocal space since Eq. 80 yields $\Delta F_{A,L,\beta}(k)$ rather than $\Delta F_{A,\beta}(\mathbf{r})$ . At first glance, this appears to be a complicated and expensive fitting problem, but by separating the basis functions into their individual angular momentum channels and doing a separate regression for each channel, it becomes tractable. The details for this procedure are described in Appendix B.1.

Because the functions $h_{j}(\mathbf{r})$ are localized, in the complete basis set limit $\Delta F_{A,\beta}(\mathbf{r})=\hat{y}^{2}_{A,\beta}(\mathbf{r})$ outside the augmentation region. Therefore, by defining

[TABLE]

one can compute

[TABLE]

which satisfies the condition that the pseudo-feature $\tilde{G}(\mathbf{r})$ is equal to the all-electron feature $G(\mathbf{r})$ outside the augmentation region (within the limit of complete basis sets).

Within the core regions, $\tilde{F}^{\text{aug}}_{\beta}(\mathbf{r})$ is projected onto the radial support grids using a set of projectors in a very similar formalism to PAW itself. This quantity will be called $\tilde{F}^{1,\text{aug}}_{A,\beta}(\mathbf{r})$ , in keeping with the PAW notation, and it takes the form

[TABLE]

In the above equations, $\tilde{f}_{i}^{A}(\mathbf{r})$ and $\tilde{p}_{j}^{A}(\mathbf{r})$ are sets of atom-centered functions satisfying $\innerproduct{\tilde{f}_{i}^{A}}{\tilde{p}_{j}^{A}}=\delta_{ij}$ , and $\tilde{p}_{j}^{A}(\mathbf{r})$ vanish for $r>r_{c}$ so that the numerical integral of Eq. 88 can be performed efficiently on the FFT grid points inside the cutoff sphere of atom $A$ . The symbol $\differential v$ represents the FFT grid volume element. This is directly analagous to the projector-partial wave pairs in PAW [62]. The construction of $\tilde{f}_{i}^{A}(\mathbf{r})$ and $\tilde{p}_{j}^{A}(\mathbf{r})$ is described in Appendix B.2. $\tilde{F}_{A,\beta}^{1,0}(\mathbf{r})$ is an approximate convolution of $\tilde{\theta}^{1}_{A,\alpha}(\mathbf{r})$ , and while arbitrary, it helps make the $\tilde{f}_{j}^{A}(\mathbf{r})$ expansion more efficient. Using this construction, the all-electron and pseudo-features on the radial support grids are, respectively,

[TABLE]

The exchange correlation energy is then

[TABLE]

where all three terms are semilocal functionals of the density and of the nonlocal features. The first term is evaluated on the FFT grid and the second two on the radial support grids on the atoms.

The use of nonlocal features introduces a few additional terms to the forces and stresses not present for semilocal DFT. The force terms are

[TABLE]

The stress terms are

[TABLE]

In the above equations, $\mathbf{F}^{\text{sl}}_{A}$ and $\sigma_{\mu\nu}^{\text{sl}}$ refer to the force and stress equations for semilocal functionals within the PAW formalism (see Section IIID of Ref. [66]).

There are several significant contributions to the computational cost of the above approach for evaluating $G(\mathbf{r})$ , which all scale linearly or quasi-linearly with system size. The first contribution comes from computing the Fourier transforms of $\tilde{\theta}_{\alpha}^{\text{aug}}(\mathbf{r})$ and $\tilde{F}^{\text{aug}}_{\beta}(\mathbf{r})$ (as well as the functional derivatives with respect to these quantities), which is necessary to compute the convolutions like those in Eq. 86 in reciprocal space. The cost of these operations scales as $\Theta(N_{\alpha}\,N_{\text{FFT}}\,\log(N_{\text{FFT}}))$ , where $N_{\alpha}$ is the number of kernel interpolation points in Eq. 57 and $N_{\text{FFT}}$ is the size of the FFT grid used for integrating the XC energy. The second contribution arises from computing the convolutions in Eq. 86 in reciprocal space, and scales as $\Theta(N_{\alpha}^{2}\,N_{\text{FFT}})$ . The third contribution comes from computing the PAW correction to $\tilde{\theta}_{\alpha}^{\text{aug}}(\mathbf{r})$ (the sum in Eq. 84). This and related terms scale as $\Theta(N_{\text{FFT,atom}}\,N_{\text{atom}}\,N_{\alpha})$ , where $N_{\text{FFT,atom}}$ is the number of FFT grid points inside the core region of an atom and $N_{\text{atom}}$ is the number of atoms. The last contribution consists of all the other PAW correction-related routines (e.g. computing $C_{i\alpha}^{A}$ ), which depend only on the density on the atomic radial grids and therefore scale linearly with $N_{\text{atom}}$ . The overall scaling with system size $N$ is then $\Theta(N\log(N))$ . Notably, none of the components of the algorithm scale with the number of k-points.

III Results and Discussion

Using Gaussian process regression, we fit each of the model types in Section II.3 (SL-GGA, NL-GGA, SL-MGGA, and NL-MGGA) to match the exact exchange energy on subsets of the GMTKN55 [10] and SOL62 [67, 33] databases, as well as a few dozen other training points. The GMTKN55 database contains 1505 main-group molecular reactions including covalent and non-covalent bond energies, isomerization energies, and barrier heights. The SOL62 database consists of the cohesive energies of 62 solids, including metals and non-metals. For all training systems, the reference exchange energy was determined by computing the exact exchange energy of the orbitals obtained from PBE calculations. We describe the training, validation, and test partitions, training data collection, and validation criteria in Appendices A.1-A.8.

In Section III.1, we demonstrate that the NL-MGGA is capable of learning the exchange functional more accurately than the other feature sets, and in Section III.2 we use 1 and 2-electron systems to provide intuition for why this is. Next, we fine-tune the accuracy and numerical stability of the NL-MGGA (Section III.3) and demonstrate that the resulting exchange functional can be used to match the accuracy of hybrid DFT (Section III.4).

Having designed an accurate exchange functional, we demonstrate that it can be used to improve band gap predictions over standard meta-GGAs in Section III.5. In Section III.6, we show that for large systems in plane-wave DFT, our implementation of the NL-MGGA form is similarly efficient to semilocal meta-GGAs and orders of magnitude more efficient than hybrid DFT. Lastly, in Section III.7, we take advantage of our new functional’s efficiency and accurate band gap predictions to compute charge transition levels of point defects in silicon, which is a notoriously challenging task for DFT due to the underestimation of band gaps and the large supercell sizes required to simulate isolated point defects [6].

III.1 Accuracy of Different Model Types for Molecular and Solid-state Benchmarks

We assess the accuracy of our exchange functionals by substituting them for exact exchange in hybrid functionals to create a “surrogate hybrid” functional. For example, one of the simplest and most popular hybrid functionals is PBE0 [68], which mixes a 1/4 fraction of exact (Hartree-Fock, HF) exchange into the PBE functional [23]:

[TABLE]

In this section we perform so-called PBE0/CIDER calculations, where the exchange-correlation energy is

[TABLE]

One can also generalize PBE0/CIDER with different fractions of CIDER exchange, giving the PBE0( $\alpha$ )/CIDER form, which is used in later sections:

[TABLE]

All ML functional results in this section refer to the PBE0/CIDER surrogate hybrid form. Because our exchange functionals are trained on exact exchange energies, they are “perfect” if they exactly reproduce the predicted reaction energies of PBE0 on the test set. To measure the error of the exchange functionals, we calculate the average error between PBE0/CIDER and PBE0 for each functional type on GMTKN55 and SOL62. All calculations on GMTKN55 are SCF calculations, but the SOL62 cohesive energy CIDER and PBE0 calculations are performed non-self-consistently with PBE orbitals. See Appendices A.2 and A.3 for details.

The average deviations of the machine-learned functionals from PBE0, using the mean of means (MoM) metric (i.e. the mean of the mean absolute deviations, or MADs, for the subdatabases of GMTKN55), are presented in Fig. 1. Results for all functionals include D4 dispersion corrections [69], with the PBE0/CIDER functionals using the PBE0 dispersion parameters. The nonempirical PBE and r2SCAN [70] functionals, while not designed to reproduce PBE0, are included in Fig. 1 to show typical deviations between GGAs, meta-GGAs, and hybrids.

While the trends vary slightly in the different subsets of GMTKN55, the key trend is that the quality of the fit to exact exchange improves going from SL-GGA to NL-GGA to SL-MGGA to NL-MGGA. Notably, the SL-GGA and NL-GGA suffer from out-of-distribution transferability issues, resulting in larger deviations from PBE0 even compared to the PBE functional on the test set. On the other hand, the ML meta-GGAs are more transferable outside the training data than the ML GGAs, so they match PBE0 more closely than PBE and r2SCAN on the test set. The NL-MGGA is more accurate than the SL-MGGA, achieving an MoM of less than 1 kcal/mol on the test set. The fact that the NL-MGGA outperforms both the NL-GGA and SL-MGGA suggests that the kinetic energy and nonlocal features provide non-redundant information, meaning that nonlocal density features cannot replace the kinetic energy density nor vice versa. As shown in Fig. 2, the same general trends apply for SOL62; the NL-MGGA is the most accurate and transferable ML functional for reproducing PBE0, followed by the SL-MGGA and then the NL-GGA.

Because GMTKN55 contains a variety of properties and systems, the performance of the functionals is somewhat dependent on the choice of error metric. However, the general trends are the same as for the MoM metric. See Supplemental Material Section S2.1 for a discussion of the impact of error metric on the results.

III.2 Impact of Nonlocality and Derivative Discontinuity on Model Accuracy

Overall, the SL-MGGA model is more accurate than the NL-GGA model across the databases tested here. However, the fact that NL-GGA improves over SL-GGA and that NL-MGGA improves over SL-MGGA suggests that the nonlocal features contribute useful, non-redundant information about the density for exchange energy prediction. This raises the question: What types of systems require the nonlocal features, and which require the semilocal orbital dependence of the kinetic energy density? We hypothesize that the derivative discontinuity of the exchange functional [71] plays a key role. Because only the kinetic energy density—and not the nonlocal features of the NL-GGA—includes a derivative discontinuity passing through integer electron number, the NL-GGA (as well as any smooth, pure density functional) cannot be expected to accurately capture energy differences involving a change in electron number, such as ionization energies and electron affinities. On the other hand, existing semilocal meta-GGAs are fundamentally limited by their locality and cannot be expected to improve over GGAs for energy differences between systems of like electron number. This is especially true of one-electron systems, for which the kinetic energy is simply determined by the density gradient.

To explore this hypothesis, we set up two simple sets of systems. The first set is one-electron systems for different arrangements of protons: the hydrogen atom, the \cfH2+ molecular ion, the linear \cfH3^2+ molecular ion, and the triangular \cfH3^2+ molecular ion, all with bond lengths of 0.7 Å, as illustrated in Fig. 3. Because each system contains only one electron with different degrees of delocalization, we expect that only the functionals with nonlocal features can accurately predict energy differences between these systems. Table 1 shows the MAD between the ML exchange functionals and Hartree-Fock for the energy differences between the hydrogen atom and the molecular ions. As expected, the SL-GGA and SL-MGGA are both less accurate than NL-GGA and NL-MGGA. This confirms that the nonlocal features successfully characterize single-electron delocalization in a way that is not feasible with semilocal functionals.

The second set of systems consists of the \cfHe+ ion, the \cfHe singlet (closed-shell ground state), and the \cfHe triplet (open-shell excited spin state with two same-spin electrons). Because the exchange energy has a discontinuous derivative as the number of electrons of a given spin passes through an integer [72, 71], the energy differences between \cfHe+ and \cfHe triplet and between \cfHe singlet and \cfHe triplet can only be described accurately by functionals that capture this derivative discontinuity. As shown in Table 1, the absolute deviations from Hartree Fock for the energy differences between these three systems are larger for SL-GGA and NL-GGA than for SL-MGGA and NL-MGGA. This indicates that the derivative discontinuity included in the kinetic energy density is essential for capturing energy differences related to electron addition/removal. Notably, the NL-MGGA is the most accurate of the four functional types for both datasets.

These observations regarding the role of the derivative discontinuity explain why a previous attempt at machine learning a GGA resulted in more accurate atomization energies but less accurate ionization potentials than PBE [31]. Likewise, the role of nonlocality explains why Kovács et al. [13] found that existing meta-GGAs, as well as 25 meta-GGAs they empirically trained, suffer from a trade-off between cohesive energy accuracy and band gap accuracy. Meta-GGAs are limited in their ability to describe both these properties at once due their semi-locality. In order to learn a nonlocal, orbital-dependent quantity like the exchange energy, an ML functional must contain both orbital dependence and nonlocality in its feature set.

III.3 Fine-tuning for Numerical Stability

Because XC functionals are used within SCF calculations, they must be numerically stable and consistently achieve electronic convergence on real systems to be useful for practical applications. For the GMTKN55 dataset, we found that some systems do not quite converge to our default convergence thresholds, which were $10^{-8}$ Hartree atomic unit (Eh) for the total energy and $10^{-4}$ Eh for the orbital gradient of the energy. (See Appendix A.9 for more details on how we remedied convergence problems.) We found that these calculations could only be converged with a $10^{-7}$ Eh energy threshold and $2\times 10^{-3}$ Eh gradient threshold. Interestingly, this problem is least severe for the NL-MGGA, as illustrated in Table 2, which shows the number of systems for each functional type which did not converge to the $10^{-8}$ Eh energy and $10^{-4}$ Eh gradient thresholds. We believe this is because the increased model dimensionality of the NL-MGGA allows it to fit the exchange energy with a smoother function of the input features.

Because the NL-MGGA still sees some convergence problems, we decided to explore two modifications of the NL-MGGA for improved numerical stability. The first is the NL-MGGA-PBE, which has the same hyperparameters and feature shape as the NL-MGGA but uses the PBE functional as a baseline rather than the Chachiyo functional [73]. NL-MGGA-PBE was screened during the validation procedure (Appendix A.8) but was found to be slightly less accurate than the NL-MGGA with the Chachiyo baseline. However, only 1 system sees a convergence issue in Table 2 with NL-MGGA-PBE, compared to 3 with NL-MGGA. We also tried training an NL-MGGA model with the PBE baseline and a more chemically diverse subset of GMTKN55, as described in Appendix A.5. We refer to this new train/validation/test partition as DTR (for “diverse training”), and it yields the NL-MGGA-DTR, for which all GMTKN55 systems achieve an orbital gradient convergence of $10^{-4}$ Eh or lower. In fact, all but 10 of the 2462 GMTKN55 systems converge to within $10^{-9}$ Eh for energy and $10^{-4.5}$ Eh for orbital gradient with NL-MGGA-DTR, and the rest to within $10^{-8}$ and $10^{-4}$ Eh, respectively.

As shown in Fig. 4, which displays MoM errors of PBE0/CIDER versus PBE0, the accuracy of these three NL-MGGA variants is similar across the GMTKN55 dataset. The three functional variants also have nearly identical performance for the SOL62 cohesive energies, with MADs compared to PBE0 of 0.13 eV/atom for all three. Therefore, for the remainder of this work, we adopt the NL-MGGA-DTR model as our primary functional of choice for applications, and we name it CIDER23X-NL-MGGA-DTR.

III.4 Obtaining Hybrid DFT Accuracy

In this section, we show that by substituting NL-MGGA-DTR for exact exchange in hybrid functionals, accurate predictions of the XC energy can be obtained for molecular systems without additional parameter tuning. In particular, we look at PBE0/CIDER (Eq. 97) and PW6B95/CIDER. The PW6B95 functional [74] is

[TABLE]

where $E_{\text{xc,sl}}^{\text{PW6B95}}$ is a meta-GGA term. Therefore, PW6B95/CIDER is

[TABLE]

For PBE0/NL-MGGA-DTR, PW6B95/NL-MGGA-DTR, and several semilocal [23, 75, 70, 76] and hybrid [27, 68, 74, 28, 77] functionals, Fig. 5 shows the MoM error on all of GMTKN55 as well as the mean absolute error (MAE) on the test partition used for NL-MGGA-DTR. The errors are relative to the high-accuracy benchmark data collected in the GMTKN55 database [10]. All functionals include D4 dispersion corrections [69] except for B97M-V [76], $\omega$ B97X-V [28], and $\omega$ B97M-V [77], which all use VV10 dispersion [53]. PBE0/CIDER calculations use PBE0 dispersion parameters, and PW6B95/CIDER calculations use PW6B95 dispersion parameters. We show conventional semilocal functionals in shades of blue, conventional hybrid functionals in orange, and CIDER functionals in green.

As Fig. 5 shows, the PBE0/NL-MGGA-DTR functional is comparable in accuracy to r2SCAN, but the combination of systematic errors in the NL-MGGA-DTR functional and in PBE0 itself prevent it from outperforming the most accurate meta-GGA (B97M-V) or any of the hybrid functionals. On the other hand, because PW6B95 is much more accurate than PBE0 on the GMTKN55 dataset, PW6B95/NL-MGGA-DTR is more accurate than all semilocal functionals explored here. It is also more accurate than the PBE0 and B3LYP [27] hybrid functionals and equivalent in accuracy to the range-separated hybrid $\omega$ B97X-V [28]. This shows that the NL-MGGA-DTR exchange functional can be used to calculate molecular reaction energies with an accuracy that was previously not feasible without exact exchange mixing.

As with Section III.1, the analysis of the errors on GMTKN55 depends somewhat on the error metric used. These nuances are discussed in further detail in Supplemental Material Section S2.2, which also provides a comparison between the nonlocal and semilocal meta-GGA models.

III.5 Using CIDER for Improved Band Gap Prediction

The band gap problem [78, 79], in which semilocal DFT tends to drastically underestimate the band gaps of solids, stands in the way of a variety of important applications like high-throughput screening for optical and electronic properties and semiconductor point defect physics [5, 6, 80]. By introducing a derivative discontinuity into the XC functional, meta-GGAs partly mitigate this problem [81], but they still tend to significantly underestimate band gaps [82]. Exceptions include mBJ [83] and TASK [12], which each have important drawbacks. The mBJ functional provides the exchange potential only, not the energy [83], and TASK does not predict atomization energies and lattice constants as well as SCAN [12, 13, 14]. This reinforces the discussion in Section III.2 regarding the accuracy trade-offs meta-GGAs must make due to their semi-locality.

By introducing nonlocality into the exchange energy, PBE0/NL-MGGA-DTR (Eq. 97 or Eq. 98 with $\alpha=0.25$ ) improves band gap predictions over PBE and r2SCAN, as shown in Table 3. This improvement occurs even though the CIDER functionals are not fit to solid-state band gaps. The PBE0/SL-MGGA also improves band gap predictions over r2SCAN, but significantly less so than PBE0/NL-MGGA-DTR. The database used for testing covers 255 of the 472 band gaps in the database published by Borlido et al. [42], with La, Yb, and Th-containing systems being excluded as well as systems with a greater than $0.05$ eV difference between the spin-orbit-corrected and non-spin-orbit-corrected band gaps (as computed with PBE by Borlido et al. [42]). See Appendix A.11 for details, and see Supplemental Material Section S3 for error metrics for all 453 solids in the database that do not contain La, Yb, or Th.

To provide a clearer picture of the improved band gap prediction provided by the NL-MGGA-DTR functional, Table 4 shows the band gap predictions for a selected subset of systems that are commonly studied [11]. In addition, Figs. 7 and 7 contain band structures for silicon (Si) and boron phosphide (BP), respectively, which show that PBE0/NL-MGGA-DTR provides similar band structures to PBE and r2SCAN but with a wider band gap.

There is room for improvement in these band gaps, since PBE0/NL-MGGA-DTR still systematically underestimates band gaps compared to the state-of-the-art HSE06 [84, 85] range-separated hybrid functional, as well as the PBE0 functional on which PBE0/CIDER is built. One possible solution is to tune the fraction of CIDER exchange, which is already done with exact exchange within hybrid DFT to optimize functional accuracy for different applications [86]. As shown in Table 3, PBE0( $\alpha$ )/NL-MGGA-DTR with $\alpha=0.35$ has increased band gaps over $\alpha=0.25$ and similar error statistics to HSE06. However, one drawback of this approach is that because NL-MGGA-DTR increases band gaps less than exact exchange for a given mixing fraction, one might need a large fraction of NL-MGGA-DTR to fit band gaps. This large fraction might not provide accurate predictions of other material properties. Alternative approaches to improve CIDER band gaps include using physical intuition and exact constraints to tune the derivative discontinuity contribution from the kinetic energy density (as is done in the TASK functional [12]), introducing more nonlocal features to improve the accuracy of the functional in general, and training on fundamental band gaps (i.e. differences between ionization potentials and electron affinities). However, considering the drastic increase in computational cost associated with using HSE06 or PBE0 compared to semilocal DFT, the ability to mitigate the underestimation of band gaps without explicitly fitting them—and while accurately predicting other properties like bond energies—already serves as a major step toward solving the band gap problem.

III.6 Benchmarking Performance on Large Condensed Matter Systems

The key benefit of the nonlocal features used to learn the exchange functional is that the quasi-linear-scaling algorithm used to compute them within plane-wave DFT (Sections II.4 and II.6) is over an order of magnitude faster than the evaluation of exact exchange for large systems and does not depend on the number of k-points. To illustrate this efficiency, Fig. 8a shows the wall time to perform a ground-state SCF calculation for a 216-atom diamond supercell with semilocal and PBE0/CIDER functionals in GPAW [60, 61]. These wall times are compared to those of the hybrid DFT implementation in JDFTx [87], which uses the ACE algorithm [17]. The CIDER functionals, including those with nonlocal features, all take less than 170 seconds (compared to 60 seconds for PBE and 79 seconds for r2SCAN). When run on the same number of processors, PBE0 is roughly 700 times slower than the NL-MGGA-DTR functional. Even the GPU-accelerated evaluation of PBE0 is about 40 times slower than NL-MGGA-DTR.

The efficiency of CIDER functionals for large systems makes it practical to perform more complicated calculations like defect formation energies. Figure 8b shows the wall time for computing the neutral vacancy formation energies in diamond and silicon using 216-atom supercells for semilocal and PBE0/CIDER functionals. All of the functionals are of comparable computational cost, with the NL-MGGA-DTR model being only 25% slower than r2SCAN for diamond and 50% slower for silicon. This small decrease in efficiency comes with a significant benefit: As shown in Table 5, of the functionals tested, the NL-MGGA model and its variants give the closest match to PBE0 for predicting these vacancy formation energies, with PBE0/NL-MGGA-DTR giving deviations of 0.18 eV and 0.23 eV for diamond for silicon, respectively. The HSE06 and PBE0 formation energies were computed using VASP [88, 89, 90]. See Appendix A.12 for further calculation details.

III.7 Charged Defect Transition Levels in Silicon

One of the consequences of the band gap problem is that transition levels in charged defects are not accurately described by semilocal DFT, so hybrid DFT is typically used when accurate transition level predictions are needed. In this section, we show that by improving band gap prediction, CIDER is able to accurately predict defect transition levels in silicon. The transition levels associated with the single vacancy [92], as well as the boron [93], phosphorus [94], copper [95], and sulfur [96, 97] substitutionals, were computed as described in Appendix A.13, using a 512-atom supercell, $2\times 2\times 2$ $\Gamma$ -centered k-point mesh, and full structural relaxation for each defect and functional.

Figure 9 shows the computed transition levels compared to experiment with two different finite-size correction schemes: the Freysoldt finite-size corrections [91] (Fig. 9a) and the potential alignment correction (Fig. 9b). We computed the transition levels with PBE and PBE0(0.3)/NL-MGGA-DTR (Eq. 98 with $\alpha=0.3$ ). The latter functional was chosen because its band gap matches the 1.17 eV experimental gap; however, PBE0/NL-MGGA-DTR yields similar results (see Supplemental Material Section S7). Supplemental Table S13 provides a more detailed list of all computed transition levels. In the rest of this section, PBE0(0.3)/NL-MGGA-DTR is referred to simply as CIDER for brevity.

PBE and CIDER both give reasonable descriptions of the ++/0 transition level in the silicon vacancy and the boron acceptor level. However, the phosphorus donor level is not well-described by PBE due to its severe underestimation of the band gap; CIDER remedies the band gap underestimation and therefore predicts a transition level in good agreement with experiment. To correct for band gap underestimation, one could predict defect levels from PBE by scaling the gap (and with it, the transition levels) to the experimental gap, but this makes it difficult to distinguish true shallow levels from deep levels that happen to sit near the (too-low) PBE conduction band edge [5]. For example, PBE predicts a transition level of 0.66 eV for the -/– transition of $\text{Cu}_{\text{Si}}$ (Fig. 9a), 0.05 eV above the conduction band minimum. Without further information, it is unclear whether this state is resonant in the conduction band or localized in the gap. CIDER, on the other hand, places the -/– level inside the gap at 0.97 eV, in good agreement with the 1.00 eV level from experiment.

More generally, Fig. 9a shows that paired with the Freysoldt correction, CIDER matches each level in the figure with an error of 0.10 eV or less compared to experiment, with the exception of the +/0 level in $\text{Cu}_{\text{Si}}$ . For this transition, the experimental level is 0.20 eV [95], while the CIDER level is 0.02 eV with the Freysoldt correction and 0.07 eV with the potential alignment correction only. The HSE06 level was previously reported to be 0.21 eV by Sharan et al. [98], in good agreement with experiment. However, Sharan et al. used a 64-atom supercell with a $2\times 2\times 2$ Monkhorst-Pack k-point mesh. This supercell size is relatively small, and the k-point mesh is both small and fails to sample the band edges, which can interfere with the description of relatively delocalized defect states. When we recalculated the CIDER transition level using Sharan et al.’s supercell size and k-point mesh, we obtained 0.18 eV with Freysoldt corrections, in good agreement with their result and with experiment. In the Supplemental Material (Section S7.1), we provide further evidence that the HSE06 level itself is not converged with respect to k-point mesh. This finding illustrates the importance of using well-converged supercell sizes and k-point meshes in point defect calculations, a requirement that is made much easier by the efficiency of the CIDER functionals compared to hybrid DFT. Why this particular transition level is underestimated by both hybrid DFT and CIDER is an interesting question warranting future investigation.

In addition to yielding more accurate transition levels, CIDER predicts somewhat different charge distributions than PBE as well. Figure 10 shows the charge density differences between charge states for several of the transition levels in Fig. 9. There are a few notable takeaways from these plots. First, as expected, the charge density differences are larger for transitions that result in large ionic displacements (as indicated by the MaxDisp in the bottom of each plot). Second, these ionic displacements and resulting charge distortions are noticeable more than 5 Å away from the defect site for some transitions, indicating the importance of large supercells. Lastly, a slightly larger fraction of the charge associated with the transition is contained within a 5 Å sphere around the defect site for CIDER than for PBE (as indicated by the Loc. metric in the bottom of each plot), suggesting that CIDER localizes the defect charge more effectively than PBE. This is reassuring because unphysical charge delocalization is one of the key driving factors of erroneous charge transition level predictions in semiconductor point defects [5].

In summary, our results indicate that the CIDER approach could be a route toward high-accuracy defect calculations at similar cost to semilocal DFT, which would significantly broaden the potential scope of point defects research. In particular, it could be used to improve the accuracy of high-throughput point defects studies [99], which are currently only computationally feasible with semilocal functionals like PBE.

IV Conclusion

We have demonstrated that by combining semi-local orbital-dependent and non-local density-dependent features, it is possible to design functionals with comparable accuracy to hybrid DFT at a drastically reduced computational cost for plane-wave DFT calculations. We can do this by learning exact exchange energies, without the use of any high-accuracy reference data from experiments or wave function theory. We have shown that our model has higher accuracy than semi-local functionals on both molecular benchmarks and solid-state band gap prediction, and that these improvements can assist with applications such as semiconductor point defect calculations.

The developments presented here serve a variety of purposes. As is, the exchange functional we designed can be used as an exact exchange surrogate in scenarios where hybrid DFT is impractical, such as large extended systems. This expands the scope of calculations that can be performed at hybrid DFT accuracy. In addition, the techniques we have introduced to learn smooth, physically constrained functionals and implement efficient nonlocal features could be used in other contexts, such as orbital-free DFT [100] and van der Waals functional design [52, 63, 101].

Importantly, the ability to systematically vary the descriptor complexity in our machine learning approach allows us to extract physically nontrivial information about the nature of the exchange interactions. Specifically, we find that both nonlocality and derivative discontinuity are independently important for the structure of the exchange energy density functional. This clarifies a key limitation of existing semilocal functionals.

Lastly, our approach can be generalized to learn the full XC functional, providing a path to highly transferable and universal functionals capable of describing challenging and heterogeneous chemistry. We anticipate that these developments will contribute to significantly improving the cost-accuracy trade-off of DFT, accelerating the pace and efficacy of materials and chemical research by providing a stronger theoretical backing in various sub-fields.

Appendix A Computational Details

A.1 Overview

To explore the effectiveness of different features for learning the exchange energy, four types of functional were trained, with the model and feature vector details provided in Section II.3:

•

SL-GGA: A semilocal GGA model based on the reduced squared gradient $s^{2}$ (Eq. 7).

•

NL-GGA: A nonlocal model based on $s^{2}$ and three nonlocal features with Eqs. 18 and 19 for the exponents.

•

SL-MGGA: A semilocal meta-GGA model based on $s^{2}$ and the iso-orbital indicator $\alpha$ (Eq. 9).

•

NL-MGGA: A nonlocal model based on $s^{2}$ , $\alpha$ , and three nonlocal features with Eqs. 20 and 21 for the exponents.

The purpose of training semilocal GGA and meta-GGA models is to test the expressive power of the nonlocal descriptors. In our previous work (see Table S1 of Ref. [35]), we found that a standard, semilocal meta-GGA could not effectively learn the exchange energy density, but in this work, we determined that a semilocal meta-GGA can learn the total exchange energy much more effectively than it can learn the exchange energy density.

The two primary datasets used in this work were the GMTKN55 [10] and SOL62 [67, 33] databases. The GMTKN55 database contains 1505 molecular reaction energies based on 2462 single-point calculations, covering five categories: small-molecule properties, larger molecule reactions and isomerization, barrier heights, intermolecular noncovalent interactions, and intramolecular noncovalent interactions. The SOL62 dataset consists of 42 non-metals and 20 metals as categorized by Zhang et al. [67]. We follow Trepte and Voss [33] in excluding Pb and Th from Zhang et al.’s initial set of 64 solids. Twelve subsets of GMTKN55 and half of SOL62, along with a few other systems, were used for training and validation, as detailed in Section A.4.

Throughout this work, we used pymatgen [102] and ase [103] for structure manipulation and calculation setup, fireworks [104] for automated workflow management, and Scikit-learn [105] for the Gaussian process models. The computational details for the PySCF and GPAW calculations are provided in Appendices A.2 and A.3, respectively. Appendices A.4 and A.5 describe the training and validation set selection, and Appendix A.6 covers the selection of control points for the sparse Gaussian processes. Appendix A.7 describes the selection of noise hyperparameters and training/validation loss functions, and Appendix A.8 describes the list of hyperparameters tested for each model and the model selection procedure. Appendix A.9 describes how we handled convergence issues for CIDER models on GMTKN55, and Appendix A.10 describes the PySCF and GPAW settings that are specific to the CIDER functionals. Appendix A.11 describes the band gap benchmark methods. Appendix A.12 describes the computational cost benchmarking of CIDER for plane-wave DFT calculations. Finally, Appendix A.13 describes our methodology for computing charged defect transition levels.

A.2 Molecular Calculations and Reference Data Generation

Molecular systems, including the nanoclusters, were computed using the PySCF code [54, 55]. All calculations were performed using the def2-QZVPPD basis set (with effective core potentials for atoms larger than Kr) [106, 107] for the atomic orbitals and the def2-universal-jkfit auxiliary basis set [108] for the classical Coulomb energy. The seminumerical exchange (SGX) module in PySCF, which is similar to the COSX algorithm [109], was used to compute the exchange energy for hybrid functionals. For SGX, level 1 PySCF grids were used with P-junction screening turned off. For a given sub-database in GMTKN55, restricted DFT was used if all systems in that sub-database had $S_{z}=0$ ; otherwise, unrestricted DFT was used. The exception was the \cfPCl_3 transition state in the INV24 database, for which spin symmetry breaking occurs for some functionals. This system was calculated with unrestricted DFT as well. For SCF calculations, the nonlocal features and XC energy were computed on level 3 PySCF grids.

For benchmarking comparison, the GMTKN55 database was evaluated with the above settings using the PBE [23], revPBE [75], r2SCAN [70], PBE0 [68], B3LYP [27], and PW6B95 [74] functionals, all with D4 dispersion corrections [69], as well as with the B97M-V [76], $\omega$ B97X-V [28], and $\omega$ B97M-V [77] functionals.

To collect the exact exchange energy training data, the ground-state orbitals from the PBE calculations were used to compute the exact exchange energy, and the feature vectors were computed from the corresponding PBE densities. The features were computed on level 1 PySCF grids. For training set feature generation only (not the SCF calculations), the integral in Eq. 16 was computed numerically with the $\mathbf{r}_{2}$ coordinate evaluated on level 3 PySCF grids without pruning. This dense grid is necessary for numerical stability and precision in the core region.

A.3 Solid-State Reference Data Generation

The solid-state densities and ground-state energies of the SOL62 dataset were computed in GPAW [60, 61] using the PBE geometries obtained by Trepte and Voss [33], with an energy cutoff of 1000 eV and the same k-point meshes as Trepte and Voss. The $h$ parameter in GPAW was set to the minimum value required to avoid aliasing of the density, i.e. 0.096 Å. Fermi-Dirac smearing with a width of 0.01 eV was used for periodic systems. For isolated atoms, fixed occupations were used unless numerical instabilities occurred as a result, which was the case for six atoms (Pb-2, Pd-2, Pt-2, Sn-2, Ta-5, and V-5, where the number is the ground-state magnetic moment $2S$ ). In these cases, Fermi-Dirac smearing with a width of 0.002 eV was used. The default Davidson solver was used for all solids except the alkaline earth metals, which were solved with the conjugate gradient (CG) method. The CG method was used for all isolated atomic systems. To compute cohesive energies for training, the experimental ground state magnetic moments were used for the isolated atom references. The convergence threshold was set to $10^{-6}$ eV per valence electron for both isolated atoms and periodic systems. The density convergence threshold was set to $10^{-4}$ per electron for periodic systems and $10^{-3}$ per electron for isolated atoms. The eigenstates convergence threshold was set to $4\times 10^{-8}$ eV2/electron for periodic systems and $10^{-5}$ eV2/electron for isolated atoms. The cell size for the isolated atoms was chosen to be about 11.7 Å across, with perturbations included to break symmetry. Compared to a larger box size of 17.5 Å, the PBE atomic energies had a root mean square deviation of 0.012 eV and maximum deviation of 0.080 eV (for Zr). Because of the small average deviation, and the fact that the train and test reference values were computed on the atom-in-a-box systems rather than determined by experiment, the smaller boxes were used to lower the computational cost and memory requirements of training and testing. Symmetry constraints were also turned off for the isolated atom calculations.

Training the CIDER models to the solid-state data required computing the exact exchange energy and the nonlocal features for each system. To obtain exact exchange energies, the hybrid module of GPAW was used to compute the non-self-consistent exact exchange energy (EXX) from the PBE orbitals obtained in the above calculations. However, because the k-point meshes above are too dense for evaluation of EXX to be computationally feasible, the periodic systems were first recomputed with a coarser k-point mesh (for each system, the smallest even, $\Gamma$ -centered mesh with a linear k-point density of at least 4.5 k-points per Å*-1*), and the EXX was computed on this mesh. All nonlocal density features were evaluated on the density and orbitals obtained from the denser k-point mesh. They were evaluated with the internal CIDER settings $q_{\text{max}}=300$ and $\lambda=1.8$ (see Section A.10 for details).

The PBE0 cohesive energies were computed using the same procedure above, with the PBE0 energies being defined as

[TABLE]

with dense denoting the calculation done with the dense k-point mesh and coarse denoting the coarser k-point mesh discussed above.

A.4 Construction of Training and Validation Sets

The total exchange energy of a molecule is highly basis-set and code-dependent, and it is dominated by the core contributions for large atoms. In practice, only relative exchange energies between systems is relevant for most chemistry and materials science applications. Therefore, rather than training the model to match total exchange energies, we trained to match the exchange energy differences between systems. For example, the target and predicted values for the atomization energy of water would be

[TABLE]

where $E_{\text{x}}^{\text{ex}}[n]$ and $\hat{E}_{\text{x}}[n]$ are the exact and model exchange energies for density distribution $n(\mathbf{r})$ , respectively.

The training and validation sets were constructed from four sources:

Molecules: The W4-11, BH76, BH76RC, MB16-43, S22, G21IP, PA26, RG18, ACONF, ALKBDE10, HEAVYSB11, and SIE4x4 subsets from the GMTKN55 database were used for training and validation. Each set was randomly split into two-thirds training and one-third validation, except for the BH76 and BH76RC subsets. Because the data points in these subsets are based on the same systems, and also because they include the forward and backward reactions for the same chemical equation, a different splitting procedure was chosen to maintain train-validation independence. First, the 34 unique transition states in BH76 were identified and split half-and-half into the training and validation sets. All 40 reactions with the training transition states were used for training, and all 36 reactions with the validation transition states were used for validation. All 17 BH76RC reactions with systems in the BH76 training partition were used for training, and the other 13 reactions were used for validation. All data not in the training or validation set was classified as the GMTKN55 test set.

Bulk solids: Half of the SOL62 cohesive energy database (11 randomly selected elemental metals and 20 randomly selected non-metals) was used for training and validation. Nine systems (C-diamond, Li, GaAs, MgO, LiF, ZrC, TiC, Pd, and Pt) were hand-selected for the training set to cover a range of chemical space, while the other 22 systems were used for validation. All data not in the training or validation set was classified as the SOL62 test set.

Nanoclusters: Six nanocluster atomization energies were added to the training set as a “bridge” between the isolated and extended systems in the study. These were \cfCa13 and \cfZn26, with the initial structure obtained based on the Materials Project [110] lattice constant and structure optimization at the PBE/def2-TZVP level in Orca [111, 112, 113]; \cfC35H36 (hydrogen-capped diamond), obtained from Refs. [114, 115] and also structurally optimized in Orca at the PBE/def2-TZVP level; and \cfNa4Cl4, \cf[Na13Cl14]-, and \cf[Na14Cl13]+, all using the Materials Project lattice constant without further structure optimization.

Atoms: Absolute and relative energies of the isolated atoms H-Kr and a few excited spin states were included in the training set. Specifically, the training points consisted of the absolute energies of the noble gases He, Ne, Ar, and Kr; the relative energies of the elements with nuclear charge $Z$ and $Z+1$ for $Z=\{1,...,35\}$ , and the relative energies of the following ground state/excited state spin pairs: Sc 1/3, Ti 2/4, V 3/5, Cr 6/4, and Ni 2/0. The preceding numbers correspond to the magnetic moment $2S$ .

In addition to fitting to the above training data, the uniform electron gas constraint was enforced as described in Section II.3.3, and the uniform scaling constraint was enforced by the scale-invariance of the input features. The ground-state total energies of the SOL62 solids and isolated atoms were computed with GPAW as described in Section A.3. The ground-state energies for all other training data were evaluated with PySCF as described in Section A.2. The selection of train and validation databases from GMTKN55 includes a wide variety of chemistries and properties. However, because training data is sampled disproportionately from small-molecule properties, there is a large distribution shift between the train/validation and test sets, with the test sets containing more large molecules, isomerization reactions, and barrier heights. The use of this train/test partition therefore serves as a strong test of transferability. The accuracy on SOL62 also strongly reflects the transferability of a model because only 9 solid-state systems were included in the training set, and SOL62 covers a wide variety of chemistries.

A.5 A More Diverse Training Set for Numerical Stability

To design a more diverse training set for the NL-MGGA-DTR functional, a kernel function was defined for the “covariance” between two training points in GMTKN55:

[TABLE]

where $\mathbf{x}$ is the vector of errors for 82 functionals benchmarked on the GMTKN55 database by Goerigk et al. [10] with D3(BJ) corrections [116, 117, 118], normalized so that the mean square error across all functionals for that individual data point is 1. This kernel essentially provides a larger similarity score between data points for which the (normalized) errors tend to be similar for a given functional. The kernel matrix was constructed for the entire GMTKN55 database and then factorized using pivoted Cholesky decomposition (Eq. 105), and then the first 225 pivots were used for the training set, the next 75 pivots for the validation set, and the remaining 1205 pivots for the test set. This procedure constructs a training set for which the typical errors of conventional functionals are representative of the entire dataset. With this dataset partitioning, more than 99% of the GMTKN55 test set data points contain at least one system not present in the train or validation sets, so the test set is still sufficiently independent to assess the transferability of the functionals to new systems. All training points not contained in GMTKN55 were kept the same as in Section A.4.

A.6 Control Point Selection for Sparse Gaussian processes

In Eqs. 33 and 34, a set of control points $\{\mathbf{\tilde{x}}_{a}\}$ is required to evaluate the Gaussian process covariance kernel in the sparse approximation. These control points were initially selected by randomly sampling feature vectors from the PySCF grids for the training set (Appendix A.2). All training data was sampled except the SOL62 data, since the features for these systems were evaluated in GPAW.

Due to the size of the numerical grids from which the features were sampled, there were a large number of potential control points for the sparse Gaussian process, many of which were nearly linearly dependent in the implicit feature space of the covariance kernel function. To reduce the size of the control point set, a Cholesky decomposition with pivoting [119] was performed on an initial sample size of roughly $10^{4}$ feature vector samples:

[TABLE]

where $\mathbf{P}$ is a permutation matrix that orders the sampled feature vectors in decreasing order of their variance conditioned on the preceding feature vectors. We performed the decomposition with a tolerance of $10^{-5}$ and only kept control points corresponding to significant pivot indices, thereby reducing the number of control points from roughly $10^{4}$ to $10^{2}$ .

A.7 Selection of Noise Hyperparameters, Training Loss, and Validation Loss

Chemical and material properties have a wide range of magnitudes, from less than 1 kcal/mol for some van der Waals binding energies to greater than 100 kcal/mol for many covalent bond dissociation energies. Correspondingly, different magnitudes of XC errors are expected and acceptable for these properties; an error of 1 kcal/mol might make a van der Waals binding energy prediction useless but be acceptable for an atomization energy. To train the exchange functionals, it was therefore necessary to weight the noise hyperparameter of each training point to reflect the feasible and desired precision for the property of interest. To do so, we took the MAD of four hybrid functionals (PBE0, B3LYP, PW6B95, and $\omega$ B97X-V) on each sub-database (denoted $\text{MAD}^{\text{hybrid}}_{\text{DB}}$ , where DB is the sub-database of interest), and then created characteristic uncertainties for each sub-database as follows (in Hartree atomic units):

[TABLE]

where W4-11 is a small-molecule atomization energy sub-database in GMTKN55 and $\sigma_{0}=0.03$ Eh. One drawback of the above approach is that it favors high accuracy on systems that are already described well by conventional, dispersion-corrected DFT (like simple van der Waals dimers) and assumes a large uncertainty on systems like self-interaction-dominated bond energies, for which even hybrid functionals with small exchange mixing perform poorly. To remedy this, the noise parameter was augmented with a constant weighting for every system:

[TABLE]

With this definition, the matrix $\Sigma_{\text{noise}}$ in Eq. 39 is a diagonal matrix with $\left(\Sigma_{\text{noise}}\right)_{ii}$ being the $\sigma_{\text{DB}}$ for the database to which training point $i$ belongs. Constructing the Gaussian process for this noise parameter is equivalent to regression with the loss function

[TABLE]

where MSD stands for mean squared deviation and $N_{\text{DB}}$ is the number of points in the subdatabase. This weighting ensures that training points with small expected errors are appropriately weighted while those with larger expected errors for conventional functionals do not become under-weighted. For systems outside the GMTKN55 database that were in the training set, $\tilde{\sigma}_{\text{DB}}$ was set by hand.

A.8 Hyperparameter Selection and Validation Procedure

The models tested in this work have several discrete and continuous hyperparameters to be optimized, which are listed below.

•

The covariance prefactor hyperparameter $\Sigma_{\text{Cov}}$ in eqs 52–55.

•

The kernel length-scale ( $l_{i}$ , where $i$ is the feature index) in eq 56.

•

For the nonlocal functionals only, the feature exponent parameters $B_{i}$ and $C_{i}$ in Eqs. 18–21.

•

The baseline exchange functional on top of which the CIDER model is trained via $\Delta$ -learning. We tested PBE [23] and the Chachiyo GGA [73].

It would be quite challenging and expensive to implement differentiation of the marginal likelihood with respect to the first two parameters, and nearly impossible for the third since backpropagation would need to be performed through the feature evaluation step. Therefore, we used various combinations of hyperparameters to train the models, and then for each functional type, we selected the best-performing model on the validation set as the final candidate functional.

To obtain a discrete set of exponent hyperparameters to test for the nonlocal functionals, we developed a reduced number of parameters to specify the feature exponents. Similarly to the previous CIDER work [35], a control parameter $A$ tunes $B_{2}$ and $C_{2}$ simultaneously, in two different schemes. The first scheme (S1) is

[TABLE]

The second scheme (S2) provides a different mixture of density-based and gradient/kinetic energy-based terms:

[TABLE]

In both schemes, the other constants are determined by $B_{2}$ and $C_{2}$ :

[TABLE]

where $D$ is an additional parameter to tune the $B_{0}$ and $C_{0}$ coefficients separately from the other coefficients. With these schemes, all 8 parameters $\{B_{i}\},\{C_{i}\}$ that define the three-feature CIDER model were reduced to two parameters per scheme. For each scheme, we performed a hyperparameter search over the grid $A=\{1/2,1,2,4,8\}$ and $D=\{1/2,1,2,4,8\}$ , resulting in a total of 50 feature constructions to test.

Because there were already 50 features from which to select the nonlocal functionals, we set the covariance and feature length-scales for these models heuristically:

[TABLE]

where $\expectationvalue{}_{\text{train}}$ is an expectation value over all the grid points in the exchange energy training set, $\Delta F_{\text{x}}$ is the deviation of the exact exchange enhancement factor at a given point from the baseline functional enhancement factor, and $R_{1}$ and $R_{2}$ are constants. $R_{1}$ and $R_{2}$ were set based on the criteria that the resulting functionals converge consistently in both GTO and plane-wave calculations and exhibit a small total energy deviation between the PAW and all-electron GTO atomization energies for a few small molecules. $R_{2}=1$ worked best for both the nonlocal GGA and meta-GGA, while the most practical values of $R_{1}$ depended on the baseline functional and model type, as shown in Table 6. The meta-GGA models required a smaller covariance prefactor to achieve good stability than the GGA models. We set $R_{1}$ to be larger by a factor of 20 for functionals with the Chachiyo baseline compared to the PBE baseline because $\expectationvalue{\Delta F_{\text{x}}^{2}}_{\text{train}}$ was roughly 20 times larger for PBE than Chachiyo.

For the semilocal functionals, there was no need to validate over the possible feature constructions, leaving more flexibility to test $R_{1}$ and $R_{2}$ systematically. For the functionals with Chachiyo baseline, we tested $R_{1}=\left\{1,2,4,8,16,32,64,128\right\}$ , while for the PBE baseline each $R_{1}$ was divided by 20 to give a roughly equivalent $\Sigma_{\text{Cov}}$ . For the semilocal functionals, $R_{2}=\left\{1,\frac{1}{2},\frac{1}{4}\right\}$ were tested, resulting in 24 functionals to validate. Note that for NL-MGGA-DTR, we determined from validation set SCF calculations that $R_{1}=0.1$ with the PBE baseline resulted in sufficiently stable models, so this larger value was used. In addition, we did not search over feature length-scale for NL-MGGA-DTR, and instead used the features determined to be most accurate for NL-MGGA and NL-MGGA-PBE.

After the functionals were trained, we selected the top 10 functionals of each type and baseline functional, as classified by non-self-consistent performance on the validation set via Eq. 108. To validate these functionals, the GMTKN55 train and validation set reaction energies were predicted using self-consistent calculations with a convergence threshold of $10^{-6}$ Eh and a gradient threshold of $2\times 10^{-3}$ Eh. For a given functional, if any calculation failed to converge when starting from both the default initial guess of PySCF and from the PBE orbitals, the functional was deemed insufficiently numerically stable and disqualified from further consideration. The non-self-consistent cohesive energies of the SOL62 validation set were also computed. Of the functionals for which all calculations converged, the validation loss consisted of

[TABLE]

where $\text{MSD}_{\text{DB/VAL}}^{\text{SCF}}$ is the self-consistent mean square deviation on the validation partition of GMTKN55 subset DB, and where $\text{MSD}_{\text{SOL62/VAL22}}^{\text{NSCF}}$ is the non-self-consistent mean square deviation of the 22 validation-set cohesive energies per atom. The final functional of each type was selected based on which candidate had the lowest loss for Eq. 115. The validation results and selected hyperparameters for the functionals are provided in the Supporting Information (Section S4).

A.9 Difficult Convergence Cases

The default convergence tolerance for the calculations performed with CIDER functionals was $10^{-8}$ Eh, except for the NL-MGGA-PBE and NL-MGGA-DTR calculations, which were converged to $10^{-9}$ Eh when possible. Some CIDER calculations did not converge to this tolerance within the default PySCF settings. There were two different classes of convergence problems. First, for a few systems, some CIDER calculations did not converge from the PySCF starting guess, but the convergence issues were resolved by starting from PBE ground-state orbitals. Second, for some systems (especially isolated atoms and ions, as well as alkali and alkaline earth-containing systems), a numerical stability issue was observed for some CIDER calculations, in which the energy converged to a reasonable tolerance (less than $10^{-7}$ Eh), but the orbital gradients oscillated around $10^{-3}$ Eh without converging fully. In these cases, the energy tolerance was set to $10^{-7}$ Eh, and the gradient tolerance was set to the (rather large) $2\times 10^{-3}$ Eh. All CIDER calculations converged to these thresholds. PBE starting orbitals were also used for these loose-threshold calculations. Even with these looser thresholds, the achieved energy and orbital gradient convergence is still sufficient to compare thermochemical data in the GMTKN55 database.

A.10 CIDER Functional Settings

The CIDER functionals require a few additional settings to specify the precision of the accelerated feature calculations. For the molecular version, these settings are $l_{\text{max}}$ , the maximum spherical harmonic for auxiliary basis expansion; $\lambda$ , in Eq. 58; $\beta^{\text{aux}}$ , the even-tempered Gaussian factor for the $\chi$ basis in Eq. 67; and $q_{\text{max}}$ , the maximum value of $q_{\alpha}$ . The even-tempered Gaussian factor for the $\xi$ basis in Eq. 71 was also set to $\lambda$ . We chose $l_{\text{max}}=10$ , $\lambda=\beta^{\text{aux}}=1.6$ , and $q_{\text{max}}=\frac{1000}{36}Z_{\text{max}}^{2}$ , where $Z_{\text{max}}$ is the minimum of 36 and the charge of the largest nucleus in the given GMTKN55 sub-database. These settings were found to be in good agreement with numerical integration of the features. The minimum control point $q_{0}$ was set automatically based on the minimum allowed value of the kernel exponent $a(\mathbf{r})$ .

For the plane-wave implementation, only $\lambda=1.8$ and $q_{\text{max}}=300$ needed to be specified. A larger $q_{\text{max,atom}}=10Z^{2}$ was used for each atom on the radial support grids, with $Z$ being the atomic number.

A.11 Band Gap Calculations

Band gaps for the database of Borlido et al. [42] were computed in GPAW [60, 61] as the difference between the conduction band minimum and valence band maximum for a ground-state SCF calculation. The energy cutoff for these calculations was 520 eV, and the grid spacing was set to $h=0.134$ Å to avoid any aliasing of the density. Fermi-Dirac k-point smearing was used with a width of 0.01 eV, except for the noble gas systems, for which no k-point smearing was used. The k-point mesh for each system was selected to be the smallest even, $\Gamma$ -centered mesh with a linear k-point density of at least 6 k-points per Å*-1*. For each system, if the PBE band gap we obtained was more than 0.1 eV larger than the PBE band gap obtained by Borlido et al. [42], we inferred that the k-point mesh was too coarse and reran the band gap calculation with a linear k-point density of 10 k-points per Å*-1*. Using this methodology, we reproduced the PBE band gaps of Borlido et al. with a mean, mean absolute, and maximum error of -0.01, 0.03, amd 0.17 eV, respectively, and the SCAN results with a mean, mean absolute, and maximum error of 0.00, 0.03, and 0.15 eV, respectively.

For PBE0/NL-MGGA-DTR and PBE0(0.35)/NL-MGGA-DTR, a few systems did not converge with the default CIDER PAW corrections, and a denser core integration grid and smaller set of feature projectors (see Appendix B) were used. Even with these settings, one system (\cfReSe2) did not converge with PBE0(0.35)/NL-MGGA-DTR, and it had to be run with SG15 norm-conserving pseudopotentials (NCPP) [120]. For the 255 calculations in the main text, the mean, mean absolute, and maximum deviations between PBE0/NL-MGGA-DTR with NCPP and PAW were 0.03, 0.12, and 1.12 eV, respectively. In addition, the band gap for \cfReSe2 with PBE0/NL-MGGA-DTR was 1.29 eV with PAW and 1.30 eV with NCPP, so the use of pseudopotentials appears reasonable for this particular case. We think that the convergence issues are due to numerical problems with the PAW corrections for nonlocal features, and improving the numerical stability of these corrections will be a subject of future work.

For the band gaps in Table 4, the same procedure was used, except with denser k-point meshes containing 12 k-points per Å*-1*. The HSE06 band gaps were computed using VASP [88], with the k-point mesh set to contain 4000 k-points divided by the number of atoms in the unit cell. This is a coarser mesh than was used for the semilocal and CIDER calculations, but we found that the band gap difference compared to larger k-point densities was less than 0.01 eV.

A.12 Supercell and Neutral Defect Performance Benchmark

The atomic structure of the 215-atom single vacancy cells of diamond and silicon were obtained using VASP [88, 89, 90] with PAW datasets [66, 62], an energy cutoff of 520 eV, the “Accurate” precision setting, and a $2\times 2\times 2$ Monkhorst-Pack k-point mesh [121], starting with the PBE lattice constant obtained in VASP using the same settings with a $14\times 14\times 14$ $\Gamma$ -centered k-point mesh. Using these structures, SCF ground-state calculations with static atomic positions were performed in GPAW [60, 61] for the 216-atom bulk and 215-atom vacancy cells for each functional. The formation energy of the neutral defect was computed as

[TABLE]

where the bulk and defect energies were computed with 216 and 215-atom supercells, respectively, with a $2\times 2\times 2$ Monkhorst-Pack k-point mesh, 520 eV plane-wave cutoff, 0.134 Å grid spacing, and $10^{-5}$ eV convergence threshold. The time to evaluate the formation energy for each functional was computed as the sum of the time taken for the bulk and vacancy cell calculations, as obtained from the GPAW internal calculation timer. The HSE06 and PBE0 formation energies were obtained using VASP with the same structures, k-point grid, and energy cutoff.

For the $\Gamma$ -point calculations used in fig 8a, GPAW was also used for the semilocal and ML functionals. For the hybrid DFT calculations, JDFTx [87] was used with the ACE [17] algorithm for exchange, a 19.2 Eh plane-wave cutoff, a $10^{-6}$ Eh convergence threshold, and SG15 norm-conserving pseudopotentials [120].

A.13 Charged Defect Calculations

The charged defect levels were computed with PBE, PBE0/NL-MGGA-DTR, and PBE0(0.3)/NL-MGGA-DTR. The mixing parameter $\alpha=0.3$ of the last functional was selected to match the zero-temperature experimental band gap of silicon, 1.17 eV. For each functional, the lattice constant, band edges, and band gap were obtained from a unit cell relaxation with 520 eV cutoff, $h=0.134$ Å, and a $24\times 24\times 24$ k-point mesh in GPAW [60, 61]. Then, a 512-atom supercell was created for each defect, with perturbed structures used for the vacancy to break symmetry. For each charge state and defect, the geometry was relaxed in GPAW using the BFGS with Line Search algorithm to a tolerance of 0.05 eV/Å with a $\Gamma$ -point only calculation. This geometry was then used as the starting point for the final geometry relaxation, which was performed with a $2\times 2\times 2$ $\Gamma$ -centered k-point mesh and a force convergence tolerance of 0.01 eV/Å. Because PBE0(0.3)/NL-MGGA-DTR is so similar to PBE0/NL-MGGA-DTR, and because the lattice constants are similar, we saved time for this functional by starting the final relaxation from the fractional coordinates of the PBE0/NL-MGGA-DTR calculations.

The formation energies of the defects were calculated as [6]

[TABLE]

where $E_{\text{tot}}$ is the total ground-state energy of the supercell DFT calculation, $X^{q}$ is defect $X$ in charge state $q$ , $E_{F}$ is the Fermi level, and $E_{\text{corr}}[X^{q}]$ is a finite-size energy correction for the defect. The stoichiometry changes and chemical potentials $n_{i}$ and $\mu_{i}$ were neglected because they do not impact electronic transition levels. We used the Freysoldt finite-size correction-scheme for $E_{\text{corr}}[X^{q}]$ [91], which combines an electrostatic and potential alignment correction. This correction requires an estimate of the dielectric constant $\epsilon$ , for which we used the experimental value of 11.9 [122].

Because some of the defects studied in this work are shallow donors and acceptors, and because electrostatic corrections like those developed by Freysoldt et al. [91] and Kumagai et al. [123] assume localized charges, these corrections might be inaccurate. Therefore, for comparison, we also computed the transition levels using the potential alignment correction only, i.e.

[TABLE]

where $\bar{V}_{\text{esp}}^{\infty}$ is the average electrostatic potential in a region far from the defect site. This is equivalent to the full Freysoldt correction with $\epsilon\rightarrow\infty$ . We used pymatgen [102] to perform the corrections.

Appendix B Construction of Feature Projector Functions for CIDER within PAW

B.1 Construction of Feature Smoothing Projectors

We solve for the coefficients $D_{j\beta}$ and $C_{i\alpha}$ in Eqs. 82 and 83 by minimizing the loss function

[TABLE]

For the current implementation, $w_{\beta}=1$ , but this could be changed if needed to encourage better smoothness or numerical precision. We start with the following set of functions:

[TABLE]

where $\mathbb{O}$ and $\mathbb{E}$ are the sets of positive odd and even integers, respectively. Then we introduce the matrix $\mathbf{P}$

[TABLE]

In the above equation, $\mathbf{P}^{11}$ is an $N_{h}N_{a}\times N_{h}N_{a}$ matrix,

[TABLE]

where $N_{h}$ is the number of $h_{j}$ functions and $N_{a}$ is the number of control points for kernel interpolation. Similarly, $\mathbf{P}^{22}$ is an $N_{g}N_{a}\times N_{g}N_{a}$ matrix,

[TABLE]

where $N_{g}$ is the number of $g_{i}$ functions. The constants $\sigma_{11}$ , $\sigma_{22}$ , and $\sigma_{22}^{\prime}$ are regularization constants to reduce the magnitude of the feature augmentation correction in the core region, which helps with numerical stability by preventing the pseudo-feature from being unrealistically large in the core.

Lastly, $\mathbf{P}^{12}$ is an $N_{h}N_{a}\times N_{g}N_{a}$ matrix:

[TABLE]

The coefficients can be solved by minimizing the loss Eq. 121, which results in the linear system

[TABLE]

where

[TABLE]

The matrix $\mathbf{P}$ is block-diagonal in the angular momentum channels due to the orthogonality of spherical harmonics, so each angular momentum channel can be solved separately. This makes the whole routine fairly efficient, and the number of operations per SCF cycle is linear-scaling in the number of atoms. The Fourier transforms of the $g_{i}$ and $h_{j}$ functions needed for Eqs. 133 and 134 are obtained using the NUMSBT algorithm [124], and the $\Delta F_{A,\beta}(\mathbf{k})$ is decomposed into angular momentum channels via the $\Delta F_{A,L,\beta}(k)$ term obtained in Eq. 80. Also, note that there are a larger number of $\alpha$ and $\beta$ indices in the atomic augmentation regions than on the FFT grid. The above fitting procedure is only used for indices going up to the maximum $\alpha$ index on the FFT grid. The higher indices (corresponding to larger values of $q_{\alpha}$ that are only relevant in the core regions) are solved by simply projecting onto the $h_{j}(\mathbf{r})$ basis, i.e. setting $\mathbf{D}=\left(\mathbf{P}^{11}\right)^{-1}\mathbf{b}^{1}$ in the subspace of core region-only $\alpha$ indices. This procedure is equivalent to assuming that the larger $q_{\alpha}$ contributions to $\theta_{\alpha}$ and $F_{\beta}$ are localized in the core region, which is reasonable because these contributions correspond to high densities and short length-scales for $\Phi_{\alpha\beta}(r)$ .

B.2 Construction of the Feature Augmentation Projectors

The feature augmentation projectors used in Eq. 87 are constructed as

[TABLE]

The on-site function sets $\tilde{f}_{i}^{A}(\mathbf{r})$ and $\tilde{p}_{j}^{A}(\mathbf{r})$ are then computed as

[TABLE]

The construction localizes $\tilde{p}_{j}^{A}(\mathbf{r})$ inside the core region and satisfies the condition that $\innerproduct{\tilde{f}_{i}^{A}}{\tilde{p}_{j}^{A}}=\delta_{ij}$ .

Acknowledgements.

The authors thank Jennifer Coulter for helpful guidance on running JDFTx calculations. This work was supported by the National Defense Science and Engineering Graduate (NDSEG) Fellowship Program under contract FA9550-21-F-0003, the Camille and Henry Dreyfus Foundation Grant No. ML-22-075, and the Department of Navy award N00014-20-1-2418 issued by the Office of Naval Research. Computational resources were provided by the Harvard University FAS Division of Science Research Computing Group.

Appendix S1 Table of all GMTKN55 Subset Error Metrics

This section consists of a large table (shown below) with detailed GMTKN55 error metrics for each functional presented in this work. All entries contain the mean absolute deviation (MAD) relative to GMTKN55 reference values, except for the aggregate errors WT2 (WTMAD-2 as defined by Goerigk et al. [10]) and MOM (mean of means). The “DTR tr,” “DTR val,” and “DTR test” labels denote the train/test partition used to train NL-MGGA-DTR on GMTKN55, while “tr/val” and “test” refer to the train/test partition used for all other functionals. All units are in kcal/mol.

[TABLE]

Appendix S2 Analysis of GMTKN55 Results with Different Error Averaging Metrics

S2.1 Results Relative to PBE0

As mentioned in Section III.1, the analysis of functional accuracy can be sensitive to the averaging scheme for aggregate errors across the entire GMTKN55 database. In Fig. S1, we plot the average deviations relative to PBE0 of semilocal and ML functionals using two altnerative metrics to the mean of means (MoM) used in Section III.1: mean absolute deviation (MAD, top) and weighted MAD (specifically, the WTMAD-2 metric used by Goerigk et al. [10]). The general trends for these metrics are similar to those for the MoM; the SL-MGGA is more accurate than the SL-GGA and NL-GGA, and the NL-MGGA is more accurate than the SL-MGGA. However, the transferability issues suffered by the SL-GGA and NL-GGA appear even more severe when using the WTMAD-2 metric, and the NL-GGA is actually less accurate than the SL-GGA overall. The WTMAD-2 metric gives larger weight to subdatabases with smaller reaction energy magnitudes (e.g. noncovalent interactions are weighted more heavily than atomization energies), so this worse performance for WTMAD-2 suggests that the SL-GGA and NL-GGA cannot accurately describe the smaller density variations associated with these reactions. Notably, the SL-MGGA suffers a similar problem on the intermolecular NCI database; as a result, the NL-MGGA is the only functional that is closer to PBE0 than r2SCAN for every sub-database, a testament to its robustness.

S2.2 Results Relative to Reference Values

Figure S2 shows the performance of the semilocal, hybrid, and CIDER functionals on the GMTKN55 database with the three different averaging metrics MoM, MAD, and WTMAD-2 discussed in the previous section. The same dispersion corrections are used as in Section III.4. The results with MoM and MAD are very similar, while the performance of CIDER functionals relative to hybrid DFT is notably worse for the WTMAD-2. This again speaks to the difficulty of capturing small relative energies accurately for CIDER functionals, which will need to be a topic of future research. However, it is notable that the SL-MGGA models suffer from this problem more severely than the NL-MGGA-DTR; PW6B95/SL-MGGA has a worse WTMAD-2 than r2SCAN, while PW6B95/NL-MGGA-DTR has a WTMAD-2 that is better than r2SCAN and comparable to B3LYP and PBE0. As shown in Table S1, the NL-MGGA and NL-MGGA-PBE models also significantly outperform the SL-MGGA models, so the worse performance of SL-MGGA is not an artefact of the different training set used for NL-MGGA-DTR. From this analysis, we conclude that introducing a more flexible feature set like that used for the NL-MGGA and its variants results in models that perform better for a variety of systems and therefore are less sensitive to a particular error metric.

Appendix S3 More data on band gaps

Table S2 contains the error metrics for semilocal, PBE0/CIDER, and hybrid functionals for all 453 systems in the database of Borlido et al. [42] that do not contain La, Th, or Yb. Note that for several systems, we used SG15 norm-conserving pseudopotentials [120] as described in Appendix A.11.

Appendix S4 Validation Results and Selected Hyperparameters

Tables S3-S6 show the final validation results for each functional type, ordered by validation loss (Eq. 115) and labeled by internal project code name. Table S7 shows the final hyperparameters for the CIDER functionals presented in the main text. For all of these figures, see Appendix A.8 for hyperparameter definitions.

Appendix S5 Density Accuracy

The density error for a molecule $i$ is measured using numerical integrals as

[TABLE]

where $w_{g}^{i}$ and $\mathbf{r}_{g}^{i}$ are the real-space integration weights and positions for molecule $i$ . The aggregate percentage error for a sub-database is

[TABLE]

Table S8 shows the density deviations from PBE0 for PBE, r2SCAN, and the ML functionals over several subsets of GMTKN55. The NL-MGGA achieves the best match to PBE0 of the ML functionals, and this improvement is transferable to the test set. r2SCAN, however, matches PBE0 more closely than the ML functionals for most subsets. This suggests there might still be some room for improvement in designing ML functionals with accurate densities. However, NL-MGGA yields smaller deviations from PBE0 densities than PBE does for every subset, so the density distributions get closer to those of PBE0 when the ML exchange is added.

Appendix S6 Bond Lengths and Lattice Constants

To assess the accuracy of CIDER functionals for structure prediction, we performed two benchmarks. First, we computed the bond lengths of six diatomic molecules in PySCF (Table S9) and compared the PBE0/CIDER bond length predictions to those of PBE0. For these calculations, the def2-QZVPPD basis set was used without density fitting, level 4 PySCF grids were used, and the internal nonlocal feature evaluation settings were $q_{\text{max}}=3000$ , $\beta^{\text{aux}}=1.6$ , $\lambda=1.6$ , and $l_{\text{max}}=10$ . The results are shown in Table S9 and suggest that with increasing model complexity, the bond lengths more closely match PBE0. For reference, the atomization energy predictions for these molecules are shown in Table S10 and demonstrate a similar trend. Note that the CIDER models are trained on these systems, so this benchmark serves only as a basic sanity test for CIDER forces, and not as a test of generalizability.

For the second benchmark, we computed the lattice constants of the materials in the SOL62 database [67, 33] in GPAW (with the same settings as were used for the cohesive energy calculations, c.f. Sections A.3 and A.10) and compared the lattice constants to their experimental values. We separated the results into the train, validation, and test partitions to evaluate model generalizability. Also, we only computed the lattice constants for the PBE0/NL-MGGA-DTR CIDER functional since it was the most numerically stable. The results are shown in Table S11. No zero-point vibrational corrections were performed, but the mean absolute lattice constant deviation with and without these corrections for PBE is 0.01 Å, significantly smaller than the typical deviations between the DFT predictions and experiment.

The MAE of PBE0/NL-MGGA-DTR compared to experiment is slightly smaller than that of PBE on the test set, but larger than that of r2SCAN and HSE06. Due to unusually large errors on a few alkali and alkaline earth metals (i.e. K, Rb, and Ba), PBE0/NL-MGGA-DTR also has a slightly larger RMSE than PBE. The results suggest that overall, the ML functional can predict reasonable, GGA-accuracy structures for materials outside the training set, but more work is needed to achieve the accuracy of more advanced functionals like r2SCAN and HSE06 for structure optimization. In the future, we will also need to investigate the cause of NL-MGGA-DTR severely underbinding the larger alkali/alkaline earth metals (even compared to HSE06).

Appendix S7 More Detailed Defect Transition Level Data

This section provides more details on the defect transition level data. The bulk lattice constant and band gap, determined by the methods in Section A.5, are tabulated in Table S12. The defect transition levels themselves are shown in Table S13.

S7.1 The Cu +/0 Transition Level

As discussed in the main text, the copper +1/0 transition level computed with PBE0(0.3)/NL-MGGA-DTR is not converged with the supercell size and k-point mesh used with HSE06 [85] by Sharan et al. [98]. To further illustrate this point, Table S14 shows the +1/0 energy transition level for copper $E(+/0)$ for HSE06 and PBE0(0.3)/NL-MGGA-DTR for several k-point meshes. For this comparison, finite-size effects are ignored, so

[TABLE]

(with $X$ being the defect supercell). For consistency, all calculations were performed with a geometry relaxed with HSE06. The lattice constant was optimized with HSE06 with an $8\times 8\times 8$ $\Gamma$ -centered k-point mesh, and the defect geometries were optimized with HSE06 with the $2\times 2\times 2$ $\Gamma$ -centered k-point mesh. No restrictions were placed on the magnetic moment for structural relaxation, but for consistency, all static calculations for computing energy level differences were computed using fixed total magnetic moments of $S=1/2$ for the neutral state and $S=0$ for the +1 state. These are the ground-state spins as determined by Sharan et al. [98], though it is worth noting that for the 512-atom supercell and $2\times 2\times 2$ $\Gamma$ -centered k-point mesh, the $S=3/2$ spin state is favored over $S=1/2$ . The HSE06 calculations were performed with VASP [88], while the CIDER calculations were performed with GPAW [60].

The primary takeaway from Table S14 is that neither the CIDER nor the HSE06 calculations are converged with the $2\times 2\times 2$ k-point meshes. The most significant deviation is between the $2\times 2\times 2$ MP and $4\times 4\times 4$ $\Gamma$ -centered mesh for HSE06, for which the difference is 0.15 eV. In addition, PBE0(0.3)/NL-MGGA-DTR and HSE06 are in good agreement for all calculations for a given k-point mesh, with deviations of 0.02-0.04 eV.

References

All reference numbers refer to the references in the main text.

Bibliography124

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Mardirossian and Head-Gordon [2017] N. Mardirossian and M. Head-Gordon, Thirty years of density functional theory in computational chemistry: an overview and extensive assessment of 200 density functionals, Mol. Phys. 115 , 2315 (2017) . · doi ↗
2Kohn and Sham [1965] W. Kohn and L. J. Sham, Self-Consistent Equations Including Exchange and Correlation Effects, Phys. Rev. 140 , A 1133 (1965) . · doi ↗
3Wellendorff et al. [2015] J. Wellendorff, T. L. Silbaugh, D. Garcia-Pintos, J. K. Nørskov, T. Bligaard, F. Studt, and C. T. Campbell, A benchmark database for adsorption bond energies to transition metal surfaces and comparison to selected DFT functionals, Surf. Sci. 640 , 36 (2015) . · doi ↗
4He et al. [2019] Q. He, B. Yu, Z. Li, and Y. Zhao, Density Functional Theory for Battery Materials, ENERGY Environ. Mater. 2 , 264 (2019) . · doi ↗
5Lany and Zunger [2008] S. Lany and A. Zunger, Assessment of correction methods for the band-gap problem and for finite-size effects in supercell defect calculations: Case studies for Zn O and Ga As, Phys. Rev. B 78 , 235104 (2008) . · doi ↗
6Freysoldt et al. [2014] C. Freysoldt, B. Grabowski, T. Hickel, J. Neugebauer, G. Kresse, A. Janotti, and C. G. Van de Walle, First-principles calculations for point defects in solids, Rev. Mod. Phys. 86 , 253 (2014) . · doi ↗
7Perdew [2001] J. P. Perdew, Jacob’s ladder of density functional approximations for the exchange-correlation energy, in AIP Conf. Proc. , Vol. 577 (AIP, 2001) pp. 1–20. · doi ↗
8Perdew and Levy [1983] J. P. Perdew and M. Levy, Physical Content of the Exact Kohn-Sham Orbital Energies: Band Gaps and Derivative Discontinuities, Phys. Rev. Lett. 51 , 1884 (1983) . · doi ↗