Design and Analysis of Machine Learning Exchange-Correlation Functionals   via Rotationally Invariant Convolutional Descriptors

Xiangyun Lei; Andrew J. Medford

arXiv:1901.10822·physics.chem-ph·June 19, 2019

Design and Analysis of Machine Learning Exchange-Correlation Functionals via Rotationally Invariant Convolutional Descriptors

Xiangyun Lei, Andrew J. Medford

PDF

TL;DR

This paper introduces a novel data-driven method for designing exchange-correlation functionals in density functional theory using convolutional descriptors inspired by computer vision, demonstrating high accuracy in reproducing B3LYP energies.

Contribution

It proposes a new orbital-free, rotationally invariant convolutional descriptor framework combined with neural networks for exchange-correlation functional design, advancing beyond traditional physical approximations.

Findings

01

Reproduces B3LYP formation energies within chemical accuracy

02

Uses orbital-free descriptors with a 0.2 Å spatial extent

03

Suggests convolutional descriptors are promising for XC functional development

Abstract

In this work we explore the potential of a new data-driven approach to the design of exchange-correlation (XC) functionals. The approach, inspired by convolutional filters in computer vision and surrogate functions from optimization, utilizes convolutions of the electron density to form a feature space to represent local electronic environments and neural networks to map the features to the exchange-correlation energy density. These features are orbital free, and provide a systematic route to including information at various length scales. This work shows that convolutional descriptors are theoretically capable of an exact representation of the electron density, and proposes Maxwell-Cartesian spherical harmonic kernels as a class of rotationally invariant descriptors for the construction of machine-learned functionals. The approach is demonstrated using data from the B3LYP functional on…

Tables3

Table 1. Table 1 : The analytical expressions of first 4 orders of MCSH denoted by S i j k ( n ) subscript superscript 𝑆 𝑛 𝑖 𝑗 𝑘 S^{(n)}_{ijk} .

n	{ijk}	$S_{i j k}^{(n)}$	n	{ijk}	$S_{i j k}^{(n)}$
0	000	1	4	400	$105 {\hat{x}}^{4} - 90 {\hat{x}}^{2} + 9$
1	100	$\hat{x}$		040	$105 {\hat{y}}^{4} - 90 {\hat{y}}^{2} + 9$
	010	$\hat{y}$		004	$105 {\hat{z}}^{4} - 90 {\hat{z}}^{2} + 9$
	001	$\hat{z}$		310	$105 {\hat{x}}^{3} \hat{y} - 45 \hat{x} \hat{y}$
2	200	$3 {\hat{x}}^{2} - 1$		301	$105 {\hat{x}}^{3} \hat{z} - 45 \hat{x} \hat{z}$
	020	$3 {\hat{y}}^{2} - 1$		031	$105 {\hat{y}}^{3} \hat{z} - 45 \hat{y} \hat{z}$
	002	$3 {\hat{z}}^{2} - 1$		130	$105 \hat{x} {\hat{y}}^{3} - 45 \hat{x} \hat{y}$
	110	$3 \hat{x} \hat{y}$		103	$105 \hat{x} {\hat{z}}^{3} - 45 \hat{x} \hat{z}$
	101	$3 \hat{x} \hat{z}$		013	$105 \hat{y} {\hat{z}}^{3} - 45 \hat{y} \hat{z}$
	011	$3 \hat{y} \hat{z}$		220	$105 {\hat{x}}^{2} {\hat{y}}^{2} - 15 {\hat{x}}^{2} - 15 {\hat{y}}^{2} + 3$
3	300	$15 {\hat{x}}^{3} - 9 \hat{x}$		202	$105 {\hat{x}}^{2} {\hat{z}}^{2} - 15 {\hat{x}}^{2} - 15 {\hat{z}}^{2} + 3$
	030	$15 {\hat{y}}^{3} - 9 \hat{y}$		022	$105 {\hat{y}}^{2} {\hat{z}}^{2} - 15 {\hat{y}}^{2} - 15 {\hat{z}}^{2} + 3$
	003	$15 {\hat{z}}^{3} - 9 \hat{z}$		211	$105 {\hat{x}}^{2} \hat{y} \hat{z} - 3 \hat{y} \hat{z}$
	210	$15 {\hat{x}}^{2} \hat{y} - 3 \hat{y}$		121	$105 \hat{x} {\hat{y}}^{2} \hat{z} - 3 \hat{x} \hat{z}$
	201	$15 {\hat{x}}^{2} \hat{z} - 3 \hat{z}$		112	$105 \hat{x} \hat{y} {\hat{z}}^{2} - 3 \hat{x} \hat{y}$
	021	$15 {\hat{y}}^{2} \hat{z} - 3 \hat{z}$
	120	$15 \hat{x} {\hat{y}}^{2} - 3 \hat{x}$
	102	$15 \hat{x} {\hat{z}}^{2} - 3 \hat{x}$
	012	$15 \hat{y} {\hat{z}}^{2} - 3 \hat{y}$
	111	$15 \hat{x} \hat{y} \hat{z}$

Table 2. Table 2 : List of number of features for each of the descriptor sets

Descriptor Set	Number of Descriptors	Descriptor Set	Number of Descriptors
${\bar{λ}}_{(0.00)}^{(0)}$	1	${\bar{λ}}_{(0.08)}^{(1)}$	9
${\bar{λ}}_{(0.02)}^{(0)}$	2	${\bar{λ}}_{(0.2)}^{(1)}$	21
${\bar{λ}}_{(0.04)}^{(0)}$	3	${\bar{λ}}_{(0.02)}^{(2)}$	5
${\bar{λ}}_{(0.08)}^{(0)}$	5	${\bar{λ}}_{(0.04)}^{(2)}$	9
${\bar{λ}}_{(0.2)}^{(0)}$	11	${\bar{λ}}_{(0.08)}^{(2)}$	17
${\bar{λ}}_{(0.02)}^{(1)}$	3	${\bar{λ}}_{(0.2)}^{(2)}$	41
${\bar{λ}}_{(0.04)}^{(1)}$	5

Table 3. Table 3 : Errors of the outlier systems in the test set. The unit is eV. MAE is the mean absolute error of the corresponding error metric for the 15 training molecular systems plus 7 test systems

Model	Error	$C H_{3} N O_{2}$	$g l y c i n e$	$N C C N$	MAE
$N N [{\bar{λ}}_{0.04}^{(1)}]$	$ε_{a b s .}$	2.01	0.95	0.94	0.4
	$ε_{p r e d .}$	1.38	0.28	-0.21	0.12
	$ε_{f o r m .}$	1.27	0.19	-0.22	0.09

$N N [{\bar{λ}}_{0.08}^{(1)}]$	$ε_{a b s .}$	1.24	0.76	0.49	0.15
	$ε_{p r e d .}$	0.18	0.11	0.24	0.04
	$ε_{f o r m .}$	0.24	0.16	0.22	0.02

$N N [{\bar{λ}}_{0.2}^{(1)}]$	$ε_{a b s .}$	0.24	0.55	0.13	0.06
	$ε_{p r e d .}$	0.08	0.37	0.03	0.01
	$ε_{f o r m .}$	0.08	0.38	0.02	0.01

$N N [{\bar{λ}}_{0.04}^{(2)}]$	$ε_{a b s .}$	3.66	0.56	0.49	0.18
	$ε_{p r e d .}$	-3.14	0.12	0.12	0.03
	$ε_{f o r m .}$	-3.19	0.06	0.12	0.02

$N N [{\bar{λ}}_{0.08}^{(2)}]$	$ε_{a b s .}$	1.42	0.61	0.14	0.09
	$ε_{p r e d .}$	1.08	0.3	0.02	0.01
	$ε_{f o r m .}$	1.05	0.26	-0.02	0.01

$N N [{\bar{λ}}_{0.2}^{(2)}]$	$ε_{a b s .}$	0.35	0.7	0.12	0.06
	$ε_{p r e d .}$	0.08	0.19	-0.01	0.01
	$ε_{f o r m .}$	0.09	0.19	-0.01	0.01

Equations52

E_{GS} [ρ (x)] = T [ρ (x)] + J [ρ (x)] + E_{e x t} [ρ (x)] + E_{x c} [ρ (x)]

E_{GS} [ρ (x)] = T [ρ (x)] + J [ρ (x)] + E_{e x t} [ρ (x)] + E_{x c} [ρ (x)]

\tilde{ϵ}_{x c} (x) = E_{r - V W N} (ρ (x), C_{1}, γ, α_{1}, β_{1}, β_{2}, β_{3}, β_{4}) + E_{N N} (λ_{i} (ρ (x)), W_{j k})

\tilde{ϵ}_{x c} (x) = E_{r - V W N} (ρ (x), C_{1}, γ, α_{1}, β_{1}, β_{2}, β_{3}, β_{4}) + E_{N N} (λ_{i} (ρ (x)), W_{j k})

ε_{ab so l u t e} = i \sum ∣ ϵ_{x c} (x_{i}) - \tilde{ϵ}_{x c} (x_{i}) ∣ \times h^{3}

ε_{ab so l u t e} = i \sum ∣ ϵ_{x c} (x_{i}) - \tilde{ϵ}_{x c} (x_{i}) ∣ \times h^{3}

ε_{p r e d i c t} = E_{x c} - i \sum \tilde{ϵ}_{x c} (x_{i}) \times h^{3} = i \sum (ϵ_{x c} (x_{i}) - \tilde{ϵ}_{x c} (x_{i})) \times h^{3}

ε_{p r e d i c t} = E_{x c} - i \sum \tilde{ϵ}_{x c} (x_{i}) \times h^{3} = i \sum (ϵ_{x c} (x_{i}) - \tilde{ϵ}_{x c} (x_{i})) \times h^{3}

E_{x c} [ρ (x)] = \int_{R^{3}} ϵ_{x c} [ρ (x)] (x) d^{3} x

E_{x c} [ρ (x)] = \int_{R^{3}} ϵ_{x c} [ρ (x)] (x) d^{3} x

ρ (x) = ρ^{x y z}

ρ (x) = ρ^{x y z}

ϵ_{x c} [ρ (x)] (x) = ϵ^{l mn} (ρ^{x y z})

ϵ_{x c} [ρ (x)] (x) = ϵ^{l mn} (ρ^{x y z})

λ_{q}^{x y z} = (C_{q} ⊛ ρ)^{x y z}

λ_{q}^{x y z} = (C_{q} ⊛ ρ)^{x y z}

\tilde{λ}_{x y z}^{l mn} = (δ_{x y z} ⊛ ρ)^{l mn} = ρ^{x y z}

\tilde{λ}_{x y z}^{l mn} = (δ_{x y z} ⊛ ρ)^{l mn} = ρ^{x y z}

ϵ^{l mn} (ρ^{x y z}) = ϵ^{l mn} (\tilde{λ}_{x y z}^{l mn})

ϵ^{l mn} (ρ^{x y z}) = ϵ^{l mn} (\tilde{λ}_{x y z}^{l mn})

ϵ^{l mn} (ρ^{x y z}) = ϵ^{l mn} (\tilde{λ}_{x y z}^{l mn}) \approx ϵ^{l mn} (λ_{q}^{l mn})

ϵ^{l mn} (ρ^{x y z}) = ϵ^{l mn} (\tilde{λ}_{x y z}^{l mn}) \approx ϵ^{l mn} (λ_{q}^{l mn})

ϵ_{L D A} [ρ (x)] (x) = ϵ_{H E G} (ρ (x))

ϵ_{L D A} [ρ (x)] (x) = ϵ_{H E G} (ρ (x))

ϵ_{L D A}^{l mn} (ρ^{x y z}) = ϵ_{H E G}^{l mn} (ρ^{l mn}) = ϵ_{H E G}^{l mn} (λ_{0}^{l mn})

ϵ_{L D A}^{l mn} (ρ^{x y z}) = ϵ_{H E G}^{l mn} (ρ^{l mn}) = ϵ_{H E G}^{l mn} (λ_{0}^{l mn})

ϵ_{GG A} [ρ (x)] (x) = ϵ_{GG A} (ρ (x), \nabla ρ (x))

ϵ_{GG A} [ρ (x)] (x) = ϵ_{GG A} (ρ (x), \nabla ρ (x))

ϵ_{GG A}^{l mn} (ρ^{x y z}) = ϵ_{GG A}^{l mn} (ρ^{l mn}, \nabla ρ^{l mn}) = ϵ_{GG A}^{l mn} (λ_{0}^{l mn}, λ_{1}^{l mn})

ϵ_{GG A}^{l mn} (ρ^{x y z}) = ϵ_{GG A}^{l mn} (ρ^{l mn}, \nabla ρ^{l mn}) = ϵ_{GG A}^{l mn} (λ_{0}^{l mn}, λ_{1}^{l mn})

ϵ_{\nabla N}^{l mn} (ρ^{x y z}) \approx ϵ^{l mn} (\nabla_{0}^{l mn}, \nabla_{1}^{l mn}, \nabla_{2}^{l mn}, ... \nabla_{N}^{l mn}) = ϵ^{l mn} (\nabla_{q}^{l mn})

ϵ_{\nabla N}^{l mn} (ρ^{x y z}) \approx ϵ^{l mn} (\nabla_{0}^{l mn}, \nabla_{1}^{l mn}, \nabla_{2}^{l mn}, ... \nabla_{N}^{l mn}) = ϵ^{l mn} (\nabla_{q}^{l mn})

{M_{r, ij k}^{(n)} = \sideset_{P (i, j, k)} \sum μ_{r, ij k}^{2} ∣ i, j, k \in N, r \in R^{+}}

{M_{r, ij k}^{(n)} = \sideset_{P (i, j, k)} \sum μ_{r, ij k}^{2} ∣ i, j, k \in N, r \in R^{+}}

μ_{r, ij k} = in p u t ⊛ [f_{r} (x, y, z) \times S_{ij k}^{(n)} (x, y, z)] stencil

μ_{r, ij k} = in p u t ⊛ [f_{r} (x, y, z) \times S_{ij k}^{(n)} (x, y, z)] stencil

f_{r} (x, y, z) = {10 if x^{2} + y^{2} + z^{2} \leq r if x < 0

f_{r} (x, y, z) = {10 if x^{2} + y^{2} + z^{2} \leq r if x < 0

\begin{split}S_{ijk}^{(n)}(x,y,z)=\sum_{m_{1}=0}^{i/2}\sum_{m_{2}=0}^{j/2}\sum_{m_{3}=0}^{k/2}(-1)^{m}(2n-2m-1)!!\\ \times\left[\begin{array}[]{l}i\\ m_{1}\end{array}\right]\left[\begin{array}[]{l}j\\ m_{2}\end{array}\right]\left[\begin{array}[]{l}k\\ m_{3}\end{array}\right]r^{2m}x^{i-2m_{1}}y^{j-2m_{2}}z^{k-2m_{3}}\end{split}

\begin{split}S_{ijk}^{(n)}(x,y,z)=\sum_{m_{1}=0}^{i/2}\sum_{m_{2}=0}^{j/2}\sum_{m_{3}=0}^{k/2}(-1)^{m}(2n-2m-1)!!\\ \times\left[\begin{array}[]{l}i\\ m_{1}\end{array}\right]\left[\begin{array}[]{l}j\\ m_{2}\end{array}\right]\left[\begin{array}[]{l}k\\ m_{3}\end{array}\right]r^{2m}x^{i-2m_{1}}y^{j-2m_{2}}z^{k-2m_{3}}\end{split}

\left[\begin{array}[]{l}a\\ b\end{array}\right]=\frac{a!}{2^{b}b!(a-2b)!}

\left[\begin{array}[]{l}a\\ b\end{array}\right]=\frac{a!}{2^{b}b!(a-2b)!}

\left\{\begin{array}[]{l}\sqrt{\mu_{r_{1},000}^{2}},\sqrt{\mu_{r_{1},100}^{2}+\mu_{r_{1},010}^{2}+\mu_{r_{1},001}^{2}},\sqrt{\mu_{r_{1},200}^{2}+\mu_{r_{1},020}^{2}+\mu_{r_{1},002}^{2}},\sqrt{\mu_{r_{1},110}^{2}+\mu_{r_{1},101}^{2}+\mu_{r_{1},011}^{2}},\dots\\ \sqrt{\mu_{r_{2},000}^{2}},\sqrt{\mu_{r_{2},100}^{2}+\mu_{r_{2},010}^{2}+\mu_{r_{2},001}^{2}},\sqrt{\mu_{r_{2},200}^{2}+\mu_{r_{2},020}^{2}+\mu_{r_{2},002}^{2}},\sqrt{\mu_{r_{2},110}^{2}+\mu_{r_{2},101}^{2}+\mu_{r_{2},011}^{2}},\dots\\ \dots\end{array}\right\}

\left\{\begin{array}[]{l}\sqrt{\mu_{r_{1},000}^{2}},\sqrt{\mu_{r_{1},100}^{2}+\mu_{r_{1},010}^{2}+\mu_{r_{1},001}^{2}},\sqrt{\mu_{r_{1},200}^{2}+\mu_{r_{1},020}^{2}+\mu_{r_{1},002}^{2}},\sqrt{\mu_{r_{1},110}^{2}+\mu_{r_{1},101}^{2}+\mu_{r_{1},011}^{2}},\dots\\ \sqrt{\mu_{r_{2},000}^{2}},\sqrt{\mu_{r_{2},100}^{2}+\mu_{r_{2},010}^{2}+\mu_{r_{2},001}^{2}},\sqrt{\mu_{r_{2},200}^{2}+\mu_{r_{2},020}^{2}+\mu_{r_{2},002}^{2}},\sqrt{\mu_{r_{2},110}^{2}+\mu_{r_{2},101}^{2}+\mu_{r_{2},011}^{2}},\dots\\ \dots\end{array}\right\}

\overset{ˉ}{λ}_{(0.0)}^{(0)} = {M_{(0.0, 000)}^{(0)}} = {ρ} \overset{ˉ}{λ}_{(0.02)}^{(0)} = {ρ, M_{(0.02, 000)}^{(0)}} \overset{ˉ}{λ}_{(0.02)}^{(1)} = {ρ, M_{(0.02, 000)}^{(0)}, M_{(0.02, 100)}^{(1)}} \overset{ˉ}{λ}_{(0.02)}^{(2)} = {ρ, M_{(0.02, 000)}^{(0)}, M_{(0.02, 100)}^{(1)}, M_{(0.02, 200)}^{(2)}, M_{(0.02, 110)}^{(2)}} \overset{ˉ}{λ}_{(0.04)}^{(0)} = {ρ, M_{(0.02, 000)}^{(0)}, M_{(0.04, 000)}^{(0)}} \overset{ˉ}{λ}_{(0.04)}^{(1)} = {ρ, M_{(0.02, 000)}^{(0)}, M_{(0.02, 100)}^{(1)}, M_{(0.04, 000)}^{(0)}, M_{(0.04, 100)}^{(1)}}

\overset{ˉ}{λ}_{(0.0)}^{(0)} = {M_{(0.0, 000)}^{(0)}} = {ρ} \overset{ˉ}{λ}_{(0.02)}^{(0)} = {ρ, M_{(0.02, 000)}^{(0)}} \overset{ˉ}{λ}_{(0.02)}^{(1)} = {ρ, M_{(0.02, 000)}^{(0)}, M_{(0.02, 100)}^{(1)}} \overset{ˉ}{λ}_{(0.02)}^{(2)} = {ρ, M_{(0.02, 000)}^{(0)}, M_{(0.02, 100)}^{(1)}, M_{(0.02, 200)}^{(2)}, M_{(0.02, 110)}^{(2)}} \overset{ˉ}{λ}_{(0.04)}^{(0)} = {ρ, M_{(0.02, 000)}^{(0)}, M_{(0.04, 000)}^{(0)}} \overset{ˉ}{λ}_{(0.04)}^{(1)} = {ρ, M_{(0.02, 000)}^{(0)}, M_{(0.02, 100)}^{(1)}, M_{(0.04, 000)}^{(0)}, M_{(0.04, 100)}^{(1)}}

e_{X C, V W N} = e_{X, V W N} + e_{C, V W N} = ρ \frac{C _{1}}{r _{s}} + ρG (r_{s}, γ, α_{1}, β_{1}, β_{2}, β_{3}, β_{4})

e_{X C, V W N} = e_{X, V W N} + e_{C, V W N} = ρ \frac{C _{1}}{r _{s}} + ρG (r_{s}, γ, α_{1}, β_{1}, β_{2}, β_{3}, β_{4})

r_{s} = (3/4 π ρ)^{1/3}

r_{s} = (3/4 π ρ)^{1/3}

G (r_{s}, γ, α_{1}, β_{1}, β_{2}, β_{3}, β_{4}) = - 2 γ (1 + α_{1} r_{s}) \times ln {1 + \frac{1}{2 γ r _{s}^{1/2} ( β _{1} + r _{s}^{1/2} ( β _{2} + r _{s}^{1/2} ( β _{3} + r _{s}^{1/2} β _{4} )))}}

G (r_{s}, γ, α_{1}, β_{1}, β_{2}, β_{3}, β_{4}) = - 2 γ (1 + α_{1} r_{s}) \times ln {1 + \frac{1}{2 γ r _{s}^{1/2} ( β _{1} + r _{s}^{1/2} ( β _{2} + r _{s}^{1/2} ( β _{3} + r _{s}^{1/2} β _{4} )))}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Design and Analysis of Machine Learning Exchange-Correlation Functionals via Rotationally Invariant Convolutional Descriptors

Xiangyun Lei

Andrew J. Medford

[email protected]

School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, Atlanta, GA, 30318 USA

(March 7, 2024)

Abstract

In this work we explore the potential of a new data-driven approach to the design of exchange-correlation (XC) functionals. The approach, inspired by convolutional filters in computer vision and surrogate functions from optimization, utilizes convolutions of the electron density to form a feature space to represent local electronic environments and neural networks to map the features to the exchange-correlation energy density. These features are orbital free, and provide a systematic route to including information at various length scales. This work shows that convolutional descriptors are theoretically capable of an exact representation of the electron density, and proposes Maxwell-Cartesian spherical harmonic kernels as a class of rotationally invariant descriptors for the construction of machine-learned functionals. The approach is demonstrated using data from the B3LYP functional on a number of small-molecules containing C, H, O, and N along with a neural network regression model. The machine-learned functionals are compared to standard physical approximations and the accuracy is assessed for the absolute energy of each molecular system as well as formation energies. The results indicate that it is possible to reproduce B3LYP formation energies to within chemical accuracy using orbital-free descriptors with a spatial extent of 0.2 Å. The findings provide empirical insight into the spatial range of electron exchange, and suggest that the combination of convolutional descriptors and machine-learning regression models is a promising new framework for XC functional design, although challenges remain in obtaining training data and generating models consistent with pseudopotentials.

I Introduction

Since its introduction in the mid-1960s [1; 2], density functional theory (DFT) has become a much-used tool in the fields of chemistry, material science, biology and others. The popularity of DFT arises mainly from its simple formalism and low computational cost compared to wavefunction theories (WFTs). The basic formalism of DFT establishes that the exact ground-state energy can be written as functional of the electron density:

[TABLE]

where $\rho(\vec{x})$ is the ground-state electron density, $E_{GS}$ is the ground-state energy, $T$ is the kinetic energy functional for the non-interacting system, $J$ is the classical Coulomb self-energy (or Hartree energy) functional, $E_{ext}$ is the energy due to external potential (e.g. atomic nuclei), and $E_{xc}$ is the exchange-correlation functional that accounts for the difference between classical and quantum-mechanical electronic repulsion as well as the difference in kinetic energy between the non-interacting and interacting systems [2]. Of these, $T$ , $J$ , and $E_{xc}$ are independent of the external potential and are hence considered “universal” functionals, while $E_{ext}$ depends on the atomic coordinates of a chemical system. Furthermore, $T$ , $J$ and $E_{ext}$ are known exactly, so the challenge is to determine $E_{xc}$ . Although a universal and exact ground-state $E_{xc}$ functional does exist as proved by Hohenberg and Kohn [1], the form of this functional remains unknown. Numerous strategies have been employed to construct density functional approximations (DFAs) for $E_{xc}$ ; however, despite over five decades of research and hundreds of trials, no existing functionals are universally comparable to the accuracy of wavefunction theories.

Construction of DFA’s relies on two main components: a model space that describes the electronic environment and a functional that connects the model space to the energy density. The concept of improving approximations by increasing the complexity of the model space is captured by Perdew’s popular analogy to “Jacob’s ladder” [3], which reveals an important trend: the more non-local information that is used to describe the electronic environment, the better the quality of the approximation. In the early days, Kohn and Sham approximated $E_{xc}$ by assuming a uniform electron gas with an electronic density equal to the (spin) density at a local point, referred to as the local density approximation (LDA)[2]. This approximation is surprisingly accurate for delocalized systems such as metals, but its accuracy is considerably lower for molecules. The next major improvement came decades later when several authors proposed using the gradient of electron density as an additional input to the exchange-correlation functional [4; 5; 6]. This led to the development of a family of DFA’s known as generalized gradient approximation (GGA) functionals that provided an order of magnitude improvement in accuracy and started a rapid increase in the application of DFT. The next logical step would be inclusion of the second derivative in the spirit of a Taylor expansion; however, the kinetic energy density is more commonly used [3]. This “meta-GGA” (mGGA) functional family includes a diverse range of physical and empirical approximations that have generally improved accuracy, although the improvements are not always systematic [7]. The next class of functionals deviates from the strategy of adding more semi-local information by including a component of fully non-local exact exchange. These “hybrid functionals”, introduced by Becke and coworkers in the B3PW91 functional [8; 9], exhibit another general (though not necessarily systematic) improvement in accuracy. Hybrid functionals such as the ubiquitous B3LYP functional [8; 10] are very popular, particularly in the chemistry community, due to their high accuracy and relatively low cost for molecular systems. However, hybrid functionals have considerably higher computational cost and are difficult to implement for extended systems (e.g. solids and surfaces), leading to screened hybrid functionals such as HSE06 [11; 12]. Numerous other strategies have also been employed to capture long-range interactions, including fully non-local approaches such as 100% exchange functionals [13; 14], approaches that combine multiple approximations [15; 16], double-hybrid functionals including wavefunction based correlation [17; 18; 19], and functionals including dispersion [20; 21; 22; 23; 24; 25; 26]. These diverse options for model spaces indicate that inclusion of additional and increasingly non-local information increase the accuracy of DFA’s; however, the improvement of model spaces has been based primarily on chemical intuition. This is advantageous in the derivation of functionals from chemical and physical principles, but also leads to difficulties in systematically improving models or deconvoluting multiple physical effects to avoid double counting.

An alternative approach to model space development is to construct model spaces that can be systematically expanded into a theoretically complete description of the system. This is similar to a common approach in the fields of image processing and computer vision where convolutions are used to extract features/information of varying length scales [27; 28]. A noteworthy recent triumph in the application of convolutions in image processing are convolutional neural networks (convNets) [29; 30] where convolutional kernels are determined through deep learning. ConvNets have achieved unprecedented accuracy in handwriting, object, and facial recognition, and have revolutionized the field of image analysis [30; 31; 32; 33; 34; 35]. This approach can be translated to the field of functional development since electron density data can be projected onto a finite-difference grid and treated as a 3D image. With 3D convolutions, any local and semi-local feature of the electronic environment can be extracted, analogous to 2D images. In this work we explore this approach to model space construction, and show that the convolutional descriptor model space is theoretically complete in the limit that the kernel has the same size as the system.

In addition to the model space, a functional requires a mathematical connection between the model space and the exchange-correlation energy. This challenge is at the core of most functional development, and there are two distinct philosophies. The reductionist approach applies physical principles and theoretical constraints to derive “parameter-free” functionals. The PBE functional is a well-known example of this philosophy [36; 37]. These derived functionals tend to have less systematic bias toward specific molecular systems, and have a predictable accuracy across all systems. The alternative approach is empiricism, with a more practical focus on maximizing the accuracy of DFT in specific applications. Most empirical functionals are based on derived functional forms where some parameters are optimized based on molecular data of the systems of interest. The data is usually obtained from experiments or higher-level calculations. These functionals are usually accurate for systems similar to those used in training, but the accuracy is typically lower for other systems or properties [38]. The B3LYP functional and the Minnesota functionals are well-known examples of empirical functionals [8; 10; 39; 40; 41; 42]. Recently, approaches from machine learning (ML) have taken the empirical approach to functional development to its logical extreme. In a seminal paper by Snyder et al. the idea of using ML to connect density and kinetic energy density of a 1D model system is introduced [43]. The success of this approach inspired substantial interest and subsequent development of employing ML in many different ways related to DFT. An extensive review is beyond the scope of this work, but examples include the use of ML to develop molecular dynamics force-fields [44; 45; 46], application of ML models to reproduce DFT results without the use of expensive QM calculations [47; 48; 49; 50; 51; 52; 53; 54], application of ML to improve the accuracy or speed of DFT [55; 56; 57; 58] and direct inclusion of ML models in the construction of density functionals [59; 38; 60; 61; 62]. These numerous strategies have illustrated the substantial promise of ML techniques in the field of functional development.

The key concept of ML-based density functionals is that highly-flexible “universal” regression models with thousands or millions of parameters are applied to connect a model space to the exchange-correlation energy. The parameters are optimized using a large amount of known data from experiment or calculations. This strategy does not require any knowledge of the complex underlying physics, but instead the challenge arises from obtaining a sufficient amount of high-quality data and utilizing verification and validation approaches to avoid over-parameterization. Machine learning models also have the advantage of being systematically improvable by addition of training data and/or increase of the regression model complexity (i.e. increase of the number of fitting parameters). Some common choices of regression models are support vector regressors (SVR) [63], kernel ridge regression (KRR) [64], Gaussian process regressors (GPR) [65], and artificial neural networks (NN) [66; 67]. Neural networks are a particularly interesting and popular class of non-linear regression models due to their property of being theoretically capable of approximating any function to arbitrary accuracy, as proved by the universal approximation theorem [68; 69]. The complexity of a NN can be easily tuned by adding/removing neurons and layers, and prediction is very fast once training is complete. The quality of the approximation will ultimately depend on the amount of training data available and the heuristics applied during the training process, but in principle NN’s provide a route to a systematically-improvable regression model to connect a given model space to the exchange-correlation energy.

In this work, we combine the ideas of convolutional fingerprinting of electronic environments and neural networks to propose a functional design framework with systematically improvable model spaces and regression-based functionals. We show that 3D convolutions can be used to re-formulate finite-difference DFT and are theoretically complete in the trivial limit that the convolutions are equivalent to the input density. We also developed a specific class of convolutional kernels to extract features (or “descriptors”) and form model spaces that are complete and rotationally invariant, inspired by the work of Worral et al. [70] and Applequist [71]. This is combined with NN regression models and exchange-correlation (XC) data from the B3LYP hybrid functional for a range of small molecule systems to construct “surrogate” functionals. These surrogate functionals are inspected based on their accuracy as compared to the grid-projected B3LYP XC functional, which is chosen to be the ground truth in this study since it is widely used and there is no semi-local closed form for the exact exchange energy. The resulting convolutional surrogate functionals are orbital-free and have systematically increasing non-locality, providing a route to test the systematic improvement of the resulting functionals and gain insight into the locality of electron exchange. The results indicate that accuracy increases significantly with the size of the model space, and that it is possible to reproduce B3LYP energies to within chemical accuracy using a semi-local orbital-free functional with a range of $>$ 0.1 Å. However, practical challenges remain in the finite-difference representation of all-electron systems, and access to spatially-resolved exchange-correlation data is currently limited. These obstacles are non-trivial, but addressing them represents an alternative strategy for XC functional development.

II Methods

The DFT data are generated with the Psi4 package [72]. Single-point spin-paired calculations are performed at the B3LYP/aug-cc-pvtz level with both density and energy convergence set to $10^{-12}$ Ha. The geometries of molecules are taken from computational chemistry comparison and benchmark database (CCCBDB) maintained by NIST [73] and are static for all calculations. The training set consists of 15 small-molecule systems: \ceC_2H_2, \ceC_2H_4, \ceC_2H_6, \ceCH_3OH, \ceCH_4, \ceCO, \ceCO_2, \ceH_2, \ceH_2O, \ceHCN, \ceHNC, \ceN_2, \ceN_2O, \ceNH_3, \ceO_3. These molecules contain 4 common atom types (C, H, O, N) and a diverse range of single, double, and triple bonds between them. An additional 7 molecular systems that have similar chemistry are used as an independent test set: \ceCH_3CN, \ceCH_3CHO, \ceH_2CCO, \ceH_2CO, \ceH_2O_2, \ceHCOOH, \ceN_2H_4. In addition, 3 extra molecular systems with different chemistry, \ceCH_3NO_2, \ceNH_2CH_2COOH (glycine), \ceNCCN, are used to test the models’ ability to extrapolate. This “extrapolation set” is not included in the accuracy analysis, but is used to probe the generality of the model in Sec. III.3.3. The converged electronic density ( $\rho$ ) and exchange-correlation energy density ( $\epsilon_{xc}$ ) of the systems are projected onto a uniform 3D finite-difference grid and stored as 3D arrays. The overall size of each grid is 10 Å $\times$ 10 Å $\times$ 10 Å with the molecule centered in the cell, and the grid-point spacing is 0.02 Å. This results in a total of $500^{3}=125,000,000$ data points per system. Due to memory limitations, domain decomposition with sub-grids of 2Å $\times$ 2Å $\times$ 2Å are used for data manipulations including descriptor extraction, sub-sampling, and prediction. The scipy [74] implementation of fast Fourier transform convolution (FFT convolution) is used to extract electronic environment descriptors. To ensure correct padding of the convolutions, each sub-grid is combined with the 26 neighboring sub-grids prior to FFT convolutions.

The resulting data for each system is sub-sampled to reduce the computational burden of training. A “near-uniform” sub-sampling algorithm is developed to produce roughly uniform sampling density across the high-dimensional space. This improves sampling efficiency by ensuring that rare data points from the tails of the distribution are included in the training sample. An illustration of the procedure is shown in Figure 1 and detailed explanations of the procedure can be found in the Supplementary Information.

The near-uniform sub-sample is supplemented with a random sub-sample to provide information about the relative frequency of points in the original distribution. A fixed random sub-sample size of 10,000,000 in total for all training systems is selected, resulting in total sub-sample size of roughly 680,000 training points per molecule. The remaining 124,320,000 points in each molecular system are used to test the resulting models, corresponding to roughly 0.55% training data and 99.45% testing data, where $<$ 0.05% of the training data is selected non-randomly to ensure that the tails of the distribution are represented. This is supplemented with fully independent validation sets from 7 additional molecular systems that are not used in model training; these systems are considered “test” systems in the remainder of the paper, while molecular systems used in model development are referred to as “training” systems although only 0.55% of their data is actually used to train the NN models. The 3 additional molecular systems in the “extrapolation set” are also used to test the model and no points from these systems are used in training.

The machine-learning models are constructed using a framework similar to $\Delta$ machine learning [75] based on residuals from a re-fitted Vosko-Wilk-Nusair (VWN) [76] LDA model. The model is formulated as:

[TABLE]

where $\tilde{\epsilon}_{xc}(\vec{x})$ is the predicted XC energy density, $E_{r-VWN}$ is the energy of a VWN LDA model with the numerical values of its parameters ( $C_{1}$ , $\gamma$ , $\alpha_{1}$ , $\beta_{1}$ , $\beta_{2}$ , $\beta_{3}$ , $\beta_{4}$ ) re-fitted to 1,000,000 randomly-selected data points from the B3LYP training systems. This re-parameterization is achieved with the Nelder-Mead algorithm [77] as implemented in scipy [74] and the same r-VWN model is used for all surrogate functionals. The $E_{NN}$ term corresponds to a NN with a set of input descriptors, $\lambda_{i}$ and weights $W_{jk}$ . The weights are optimized using the Adam algorithm for stochastic gradient descent [78] as implemented in the Keras package [79]. A standard NN architecture of 2 hidden layers with 100 neurons and the ReLU activation function [80] is used for consistency. The training data is divided into separate steps with different learning rates and loss functions; details are available in the Supplementary Information.

The accuracy of the resulting models is assessed with three different error metrics at the chemical system level: sum of local absolute error ( $\varepsilon_{absolute}$ ), energy prediction error ( $\varepsilon_{predict}$ ) and formation energy prediction error ( $\varepsilon_{formation}$ ), each probing different aspects of the models. The sum of local absolute error corresponds to the integral of absolute difference between the predicted and actual XC energy density:

[TABLE]

This is a straightforward definition of the error from the perspective of model training, and is proportional to the mean absolute error. It directly probes the absolute accuracy of the model for a system and gives an upper bound for energy prediction error. The energy prediction error corresponds to the error of the integral of the XC energy density over the system, as approximated by the sum of the predicted energy at all grid points:

[TABLE]

This is proportional to the mean signed error, and cancellation of error will result in errors lower than the sum of local absolute error. The final metric of formation energy prediction error is the most practical since these relative quantities are most relevant in chemistry, and cancellation of systematic errors is a common feature of DFT functionals. The predicted formation energy error is obtained by computing the formation energy of each species relative to the following atomic reference states that are commonly employed in DFT studies: C: \ceCH_4, N: \ceNH_3, O: \ceH_2O, and H: \ceH_2. Formation energy errors enable the most cancellation of error, but anti-cancellation is also possible in the case of over-fitted models. Hence, formation energy errors compared to local and system level errors provide a convenient and practical measure of model accuracy and over-fitting.

The code for constructing descriptors, training, and evaluating models is available via the supporting information.

III Results and Discussion

The results are presented in three parts. In Sec. III.1 the theoretical motivation for using convolutions to construct model spaces is presented by re-formulating the XC energy in terms of convolution kernels, and some advantages and limitations are discussed. In Sec. III.2 a specific class of “convolutional descriptors” based on spherical harmonics are applied to the dataset of small molecules. In Sec. III.3 the machine learning framework for XC energy is introduced through discussion of both the re-parameterized VWN model and the NN models that are used to fit the residuals. The accuracy of models based on various model spaces are presented and discussed.

III.1 Convolutional reformulation of exchange-correlation functional

The theoretical motivation for using convolutions to form a basis set for XC model spaces is based on two properties: systematic increase of spatial range, and theoretical completeness in the limit that the range is equal to the system size. The spatial range can be increased by increasing the maximum size of the convolution kernels. To show completeness we re-formulate the XC functional, $E_{xc}[\rho(\vec{x})]$ , as a function of convolutions between a set of kernels and the electron density. We show that the functional is equivalent to the function with the proper choice of kernels in the case where (i) the density is discretized onto a finite-difference (FD) grid, (ii) the maximum kernel size is equal to the system size, and (iii) the number of convolutions is equal to the number of points in the finite difference grid. This equivalence between functionals and functions for numerical representations in DFT has been noted before [81]; here we briefly examine the specific case of finite difference representations and convolutions. We restrict the discussion to the spin-paired case for simplicity, but the same arguments hold in the case of spin-polarized systems. Similar arguments are expected to hold for any energy functional including the kinetic energy or full system energy, though we focus only on XC energy here. First, we consider spatially-resolved XC energy densities:

[TABLE]

where $\epsilon_{xc}[\rho(\vec{x})](\vec{x})$ is the exchange-correlation energy density defined at each point $\vec{x}$ in space. The existence of a locally-resolved XC energy density and methods to extract it have been examined previously [82], although extracting this quantity presents a practical challenge for wave-function theories. Next, consider a finite-difference representation of the electron density:

[TABLE]

where the density is represented on a 3D grid and $x$ , $y$ , $z$ are indices of each voxel along the (x, y, z) Cartesian axes. We note that these indices can be “unraveled” such that $\rho^{xyz}$ can be considered as a 1-dimensional vector of $N^{3}$ points with a single index, although it is conceptually simpler to consider $\rho^{xyz}$ as a 3-dimensional array of voxels. Each voxel has dimensions of $h_{x}$ , $h_{y}$ , $h_{z}$ Å and a corresponding volume of $v=h_{x}h_{y}h_{z}$ Å3; for simplicity we consider the isotropic case where $h_{x}=h_{y}=h_{z}=h$ . The finite difference representation is chosen because it is intuitive, convenient for convolutions, commonly used in solid-state codes [83; 84; 85], and systematically converges to the exact density in the limit of $h_{i}\rightarrow 0$ . The XC energy can also be written in terms of a finite difference basis:

[TABLE]

where we have exploited the fact that a functional becomes a function when its argument is represented in a numerical basis (i.e. the XC energy density at each grid point is a function of the value of the electron density at every grid point). Finally, we introduce the concept of density convolutions:

[TABLE]

where $\lambda_{q}^{xyz}$ represents a vector of “convolutional descriptors” (indexed by $q$ ) spatially-resolved at each grid point (indexed by $xyz$ ), and $C_{q}$ is an arbitrary convolution kernel of size $n_{x}\times n_{y}\times n_{z}$ . For convenience we restrict discussion to the case where $n_{x}=n_{y}=n_{z}=n$ , and $n$ is odd, such that we can define the range $r_{q}$ of a convolution kernel $C_{q}$ as $r_{q}=h(n-1)/2$ . Furthermore, we consider only the case of periodic boundary conditions to avoid issues of padding. In this case $\lambda_{q}^{xyz}$ is always the same dimension as $\rho^{xyz}$ , corresponding to a feature vector indexed by $q$ at each grid point $xyz$ . The restriction to periodic boundary conditions is a minor limitation, considering that any finite system can be represented as a periodic system with sufficient vacuum padding; this is commonly exploited in plane-wave codes. Considering a set of convolution kernels produces a set of $N_{d}$ ( $q\leq N_{d}$ ) local descriptors for a point $xyz$ in space. These descriptors capture information out to a distance of a total range $R=max(r_{q})$ . If the largest dimension of the unit cell of the system is given by $L_{max}$ then in the limit of $R\rightarrow L_{max}/2$ and $N_{d}\rightarrow n^{3}$ the full non-local density can be recovered by using $n^{3}$ delta-function kernels:

[TABLE]

where $\delta_{xyz}=1$ if $xyz=lmn$ , 0 otherwise and $\tilde{\lambda}_{xyz}^{lmn}$ is the fully non-local descriptor set, equivalent to “unraveling” the entire density grid as a vector (indexed by $xyz$ ) at each spatially-resolved grid point (indexed by $lmn$ ). Substitution into Eq. 7 yields:

[TABLE]

This expression is a trivial re-statement of Eq. 7, but it has the advantage of being a system-independent mapping between a locally-centered electronic environment (as characterized by its convolutional descriptors) and a corresponding local XC energy density. However, in practice Eq. 10 is no more efficient or practical than Equation 7 since both ultimately require a fully non-local 6-dimensional evaluation of the energy functional. However, Eq. 10 provides a natural starting point for establishing controlled orbital-free approximations to the XC energy density based on sets of descriptors $\lambda_{q}^{lmn}$ where $R<<L_{max}/2$ and $N_{d}<<n^{3}$ .

[TABLE]

The two most common classes of orbital-free XC functionals, LDA and GGA, are easily reformulated in terms of convolutional descriptors. For example, the LDA functional approximates the exchange-correlation energy at a point $\vec{x}$ with the XC energy of the homogeneous electron gas with density equivalent to that point:

[TABLE]

or, in convolutional notation:

[TABLE]

where $\lambda_{0}^{lmn}=(\delta_{lmn}\circledast\rho)^{lmn}=\rho^{lmn}$ . In the case of GGA functionals the XC energy density depends on the density and its gradient:

[TABLE]

or, in convolutional notation:

[TABLE]

where $\lambda_{0}^{lmn}=\rho^{lmn}$ as before, and $\lambda_{1}^{lmn}=(\nabla_{1}\circledast\rho)^{lmn}$ where $\nabla_{1}$ is a finite difference stencil corresponding to the gradient. This pattern can be generalized to higher-order derivatives to produce a class of convolutional XC functionals based on differential stencils:

[TABLE]

where $\nabla_{q}^{xyz}=(\nabla_{q}\circledast\rho)^{xyz}$ and $\nabla_{q}$ is the $q^{\mathrm{th}}$ unmixed partial derivative stencil, with $\nabla_{0}\equiv\delta_{xyz}$ . Thus, $\epsilon_{\nabla 0}^{lmn}$ corresponds to any fully local functional (e.g. LDA) and $\epsilon_{\nabla 1}^{lmn}$ corresponds to GGA functionals, while $\epsilon_{\nabla 2}^{lmn}$ corresponds to functionals that include the Laplacian [86], etc. The idea is analogous to that of Taylor series expansion in that it uses linear combinations of different orders of derivatives to approximate a function. Hence the functional should become more accurate as higher order derivatives are included and longer-range information is taken into account. However, the approach suffers from two issues, in theory and in practice: it’s hard, if not impossible, to construct isotropic or rotation-invariant stencils for higher order derivatives, and higher order derivatives tend to become numerically unstable with practical grid spacing. It is clear that the descriptors must stay constant as the system rotates and translates. In other words, the stencils need to be rotation- and translation invariant. The magnitude of gradient operator and the Laplacian operator (trace of the Hessian) that are effectively adapted in GGA and mGGA, respectively, are known to be isotropic. However, the invariant norms of higher-order derivative tensors are not well known. Furthermore, issues with numerical stability are encountered when computing the Laplacian and higher order derivatives, even with analytical basis sets. For these reasons most functionals based on $\nabla_{2}^{xyz}$ have been abandoned for the “meta-GGA” approach, which substitutes the kinetic energy density, $\tau^{xyz}$ , for $\nabla_{2}^{xyz}$ [87]. The kinetic energy density is not orbital-free, and this transition deviates from the formalism of the Taylor expansion, making it unclear how to systematically improve model spaces beyond mGGA.

III.2 Maxwell-Cartesian spherical harmonic descriptors

Prior to selecting a convolutional descriptor set to fingerprint electronic environments, it is important to first define the necessary properties of the descriptor set. First the descriptor set needs to be complete. This means, at the limit of taking all the descriptors from the set, they should form a complete basis that can describe all possible variations in the electronic environment, or that of any 3D function in general. This complete set would be infinite, therefore the set should have a clear entry point and a systematic route toward convergence. Furthermore, the descriptors need to be invariant under translation and rotation, consistent with the symmetries of the Hamiltonian.

To find such descriptor set, the variation of 3D functions is decomposed into two parts: variations in angular coordinate (rotational variation) and variations in the radial coordinate (radial variation). The Maxwell-Cartesian spherical harmonics (MCSH) capture the rotational variations, and the radial variation is captured by varying a cutoff distance. This approach is inspired by the circular-harmonic-based 2D rotationally equivariant features developed by Worral et al. [70] that has been generalized to 3D by the work of Thomas et al. [88], as well as Applequist’s work [71] on MCSHs. Specifically, the descriptor set is defined as follows:

[TABLE]

where $M$ denotes the descriptors, $P(i,j,k)$ denotes the permutation group of $i,j,k$ , $n=i+j+k$ is the order of the descriptor, and $\mu_{r,ijk}$ is the convolution result using spherical harmonic $S_{ijk}$ with cutoff distance $r$ as the stencil:

[TABLE]

$f_{r}$ is a step function that controls the cutoff distance, and $S^{(n)}_{ijk}$ is the Maxwell-Cartesian spherical harmonic:

[TABLE]

where $m=m_{1}+m_{2}+m_{3}$ , and

[TABLE]

Thus, each descriptor is the Euclidean norm of $\mu_{r,ijk}$ with all possible combination of $i,j,k$ , and the whole set can be written as:

[TABLE]

The first four orders of the MCSHs are listed in Table 1 and illustrated in Figure 2. The detailed properties of MCSH are introduced in the work of Applequist [71]. The MCSHs are used to construct the descriptors because it is known that spherical harmonics form a complete basis for functions defined on the 3D unit sphere. This idea is analogous to that of multi-pole expansion, where the original 3D function is expressed as a linear combination of terms with progressively finer angular features [89]. Examining the MCSHs in Figure 2 reveals that the order 0 MCSH corresponds to the monopole, and captures features that are constant and independent of angle (order 0 angular feature); the order 1 MCSH corresponds to the dipole, and captures features that vary once, from positive to negative, with angle (order 1 angular features); order 2 MCSH corresponds to the quadrupole, and captures features that varies more quickly with angle (order 2 angular features), and so on. Any rotational variations can be approximated by this linear combinations of angular features of different order, and will be exact in the limit of the entire series. The descriptors are empirically verified as rotation-invariant (see Supporting Information) and the mathematical proof is in progress but is beyond the scope of this work. In addition to the rotational variations it is necessary to capture radial variations. This is achieved by taking the rotation-invariant descriptors with different cutoff distances through the cutoff function $f_{r}$ in Eq. 19, where cutoff radii are discretized based on the underlying finite-difference grid. We conjecture that this descriptor set provides a complete basis on the rotation-invariant subspace of the 3D finite difference grid. Moreover, $\mu_{0,000}^{2}$ , which is equivalent to fully local information (i.e. $\rho$ ), is the clear entry point of this descriptor set. There are 3 directions for systematic expansion: higher orders of MCSH, longer cutoff distances, and a finer grid for discretization. In this work we fix the grid spacing and explore the impact of higher orders of MCSH and longer cutoff distances.

The MCSH descriptors provide a route to include semi-local and non-local information about a local electronic environment. MCSH descriptor sets are defined by their maximum range ( $R$ ) and the maximum order of spherical harmonics ( $n$ ) that is the same as the maximum order of angular features captured, and are denoted as $\bar{\lambda}_{R}^{(n)}$ . Here, we consider 13 MCSH descriptor sets with ranges of 0 Å , 0.02 Å, 0.04 Å , 0.08 Å and 0.2 Å, and orders of 0, 1 and 2. The descriptor sets are designed such that information of longer range and higher order of angular feature are gradually added:

[TABLE]

The total number of descriptors for each of the descriptor sets are listed in Table 2.

III.3 Regression models for exchange-correlation energy

The functional form linking the MCSH descriptors and the exchange correlation (XC) energy density is not known. While reductionist approaches may be feasible, the empirical approach is more pragmatic since the physical meaning of the descriptors is not obvious. In this work we employ a machine-learning strategy based on a function with two terms: a local-density term based on a re-parameterization of the VWN functional form of LDA (r-VWN), and a descriptor-based term using a NN with ReLU activation functions. The r-VWN term is static for all regression models, and the NN is trained using the residuals of the r-VWN model (Eq. 2); this is similar to the $\Delta$ machine learning strategy proposed previously [75]. This section first discusses the results of the r-VWN model and a NN based solely on the local density, and subsequently addresses the performance of NNs based on the convolutional descriptors.

III.3.1 r-VWN and NN LDA model

The domain and range of the electron density and corresponding XC energy density span over 12 orders of magnitude, as seen in Fig. 3. This creates a substantial numerical challenge for machine-learning models since most implementations rely on double precision floats with a machine epsilon of $\sim 10^{-16}$ . Practically, the situation is somewhat better, since the vast majority of the distribution (99.3%) falls between $10^{-6.5}-10^{0}$ (Fig. 3). However, even relatively small errors in the high-density region can have a substantial impact on the system-level energy, and training a machine-learning model that is accurate across this span is challenging. Nonetheless, physical models are known to approximate the energy density across this large domain/range. The VWN parameterization of the LDA model achieves this by using an analytical function that reproduces the behavior of the homogeneous electron gas (HEG)[76]:

[TABLE]

where $C_{1},\gamma,\alpha_{1},\beta_{1},\beta_{2},\beta_{3},\beta_{4}$ are the parameters, $r_{s}$ is the Wigner–Seitz radius defined as:

[TABLE]

$G$ is defined as:

[TABLE]

The behavior of the HEG is qualitatively similar to the B3LYP training data, but there are quantitative discrepancies. These residuals were minimized using non-linear optimization with the VWN parameters as initial guesses (see Sec. II) providing an improved LDA approximation (r-VWN) to B3LYP for molecular systems of interest. The refitted model is illustrated in Fig. 3b, and the magnitude of the residuals is reduced by an order of magnitude as compared to the original energy density (Fig. 3a,c).

The r-VWN model is applied to the test and training sets, and the results are compared to the VWN and common PBE GGA functionals in Fig. 4. The results show that the energy prediction error stays approximately constant ( $MAE_{VWN}$ = 67.44 eV, $MAE_{r-VWN}$ = 69.18 eV). Interestingly, both VWN and r-VWN have lower energy errors than PBE, and the “test” set has higher errors on average for both VWN and PBE despite the fact that they are not trained on the training set (i.e. the test set has inherently larger system-level errors). The trend between VWN and r-VWN is similar for the formation energy errors, where both systematically underestimate the B3LYP formation energy, and the r-VWN model has less systematic error ( $MAE_{VWN}$ = 1.69 eV, $MAE_{r-VWN}$ = 1.23 eV). The magnitude of formation energy errors are also $\sim$ 2 orders of magnitude smaller than the system-level errors, due to cancellation of error. The formation energy errors for PBE are somewhat lower than even the r-VWN model, opposite of the trend for system-level energies, indicating that PBE relies more on cancellation of error between systems than the LDA models.

The residual learning framework is also applied to the LDA model space ( $NN[\bar{\lambda}_{(0.00)}^{(0)}]$ ) to provide a control for the convolutional descriptor models. Although the $r-VWN$ model doesn’t show significant improvement in accuracy, it provides a good approximation to B3LYP XC energy density, and learning the residual is easier than learning the energy density directly due to a reduction in the range of the dependent variable. The results, also shown in Fig. 4, show a significant improvement in the system-level energy and a less dramatic but still significant improvement in the formation energy. Interestingly the formation energies from the NN model are more accurate than the GGA results, despite the fact that the GGA model space contains more information. The cancellation of error indicates that the NN model is not over-fit, and the improved performance indicates that there is room for improvement of LDA models by increasing flexibility, consistent with prior studies [59]. However, as shown in Figure 3(c), the local energy residuals for the $r-VWN$ model shows three tails, which are attributed to the core regions of C, O, and N. In both cases the main error arises from the fact that the local energy is not a single-valued function of the local density, especially in the core regions. This suggests that the limiting factor to further improve accuracy is not the flexibility of the XC model, but the information contained in the model space. This motivates the inclusion of additional descriptors (Sec. III.3.2).

To get more insight into the origins of the errors for the $r-VWN$ and $NN[\bar{\lambda}_{(0.00)}^{(0)}]$ functionals we examine the contribution to the system-level error as a function of density. From Figure 3 it is clear that the domain and range of both the electron and energy density span many orders of magnitude, and that there is a wide range in the number of points that occur at different electron densities, with the vast majority falling between $10^{-6.5}-10^{0}$ . The system-level energy is ultimately computed by integrating (approximated by summation) the local energy density, hence the system-level error will depend on a trade-off between the size of the error at a given density and the number of points with that density. This is illustrated in Fig. 5, where the contribution to the system-level error is plotted as a function of the electron density. The results indicate that for the r-VWN model nearly all of the system-level error occurs in the density region of $10^{-3}-10^{1}$ eV/Å3, corresponding to the valence/bonding regions of the molecular system. This is intuitive, since this is where chemical bonding occurs, and where there are an appreciable number of data points (29.5%), the energy density is relatively large ( $10^{-4}-10^{1}eV/$ Å3), and is multi-valued (see Fig. 3(c)). In comparison, the error from the NN[LDA] model is much smaller, and concentrated in the region of $10^{-1}-10^{1}eV/$ Å3. This is attributed to the near-core regions where the energy density is relatively large and multi-valued, making it impossible for the neural network to capture the behavior.

This trade-off between the error and the number of sample points highlights the importance of the inclusion of randomly sampled data in the sub-sampling routine, and the selection of a proper choice of objective function during training. The randomly sampled data effectively weights the error at each density by the number of points similar to the weight that will be used to compute the system-level energy. Enough randomly sampled data must be included to ensure that regions with low electron/energy density contribute to the objective function, but if too much randomly-sampled data is included it will overwhelm the contribution from the tails of the distribution. The ratio of random to uniform sub-samples is chosen heuristically to minimize the system level error in this work. A multi-step training process is also employed, with multiple types of objective function. The details of the training procedure can be found in the Supplementary Information. Ultimately, the results of the $r-VWN$ and $NN[\bar{\lambda}_{(0.00)}^{(0)}]$ models indicate that information beyond the local electron density must be included in the model space to significantly improve accuracy.

III.3.2 NN models with MCSH descriptors

To capture more non-local information about the electronic environments MCSH descriptors are used, and NN models are applied to connect the descriptors to the energy density. For each set a NN model with 2 hidden layers of 100 nodes each and ReLU activation functions is used as the non-linear model for the XC energy density, denoted as NN[ $\bar{\lambda}_{R}^{(n)}$ ] where $R$ denotes the cutoff radius and $n$ denotes the order of the MCSHs used (see Eq. 17). Each NN model is trained using a consistent training procedure, as described in the Supplementary Information. The hypothesis that including more semi-local information in the model space will systematically improve the model accuracy is tested by comparing system-level sum of local absolute error ( $\varepsilon_{absolute}$ ), energy prediction error ( $\varepsilon_{predict}$ ) and formation energy prediction error ( $\varepsilon_{formation}$ ) as defined in Section II with a consistent NN architecture. Systematic improvement is defined as improvement for each individual system without exception, while general improvement is defined as a decrease in the mean absolute error.

The results for general improvement are shown in Figure 6, indicating that the general accuracy always improves as more descriptors are added. The detailed result for each model and system are given in the Supporting Information. In this section we focus on the models’ performances for the 15 training and 7 test systems; the 3 extrapolation systems are described in Sec. III.3.3. Based on Figure 6(d), general improvement in the model accuracy is observed as angular features from zero-th order to first order are included and the range is increased from 0 to 0.2 Å. The inclusion of the first-order angular features has a drastic impact, where the accuracy of the first-order model with a range of 0.02 Å is comparable to the zeroth-order model with a range of 0.08 Å. The first-order angular feature is needed to express the reduced gradient, and the grid spacing is 0.02 Å, so the NN[ $\bar{\lambda}^{(1)}_{0.02}$ ] model is analogous to the GGA model space. A further and substantial improvement is observed as the range is increased, with a minimum MAE of 0.061 eV achieved at a range of 0.2 Å. The inclusion of descriptors of second-order angular features further improves the model, particularly at short ranges, but the improvement is less drastic.

The results for the systematic improvability test for the sum of absolute error are shown in Figure 7, where the number represents the maximum increase in error for any given system when the order of the angular feature or spatial range is increased; a value of 0 represents a systematic improvement since the error of every system is decreased without exception. The results show that systematic improvement is often, but not always, observed when additional descriptors are added. In particular, systematic improvability is not observed for zeroth-order angular features as range is increased from 0 to 0.08 Å. This is hypothesized to occur because the added descriptors contain relatively little additional information, causing statistical noise to play a larger role in training. The randomly-selected training data represents only 0.55% of the total data, and neural networks are initialized with random weights, causing the resulting models to favor some electronic environments over others due to randomness. This could possibly be overcome by using a static training set and a systematic strategy for initializing the neural networks, or by adding descriptors with more information. The latter strategy is shown to work here, as systematic improvability is achieved when higher-order angular descriptors and/or longer-range information are included. One exception to this trend is observed when moving from descriptors of first-order angular features to that of second-order angular features at a range of 0.2 Å. This is attributed to the fact that the dimensionality of the descriptor space increases substantially from 21-41, but the flexibility of the NN model is not increased. The points will be more separated in a higher-dimensional space, as supported by the fact that the number of uniformly-sampled data from the $\bar{\lambda}_{(0.2)}^{(n)}$ sets are an order of magnitude higher than for the $\bar{\lambda}_{(\leq 0.08)}^{(n)}$ descriptors (see Supplementary Information). This forces the model to interpolate over larger distances, and will generally require a more complex NN model. These results provide evidence supporting the hypothesis that systematic XC model improvement can be achieved by systematically increasing the range and rotational order of convolutional descriptors, though optimization of the regression model architecture and training procedure is an important consideration.

Systematic and general improvement of the absolute error is promising, but the physical quantities of energy prediction error and formation energy prediction error are of practical interest. The absolute error provides an upper bound for these quantities, which can take advantage of cancellation of error within a single system (energy prediction error) and across systems (formation energy prediction error). Indeed, the general accuracy of these quantities is greatly improved as compared to the absolute error as shown in Figure 6. The MAE of the prediction error reaches “chemical accuracy” (0.043 eV) at a spatial range of 0.08 Å for the first-order rotational descriptors, and 0.04 Å with second-order rotational descriptors. A longer range is needed to reduce the maximum error to below chemical accuracy, but this can be achieved with both first- and second-order rotational descriptors at a range of 0.2 Å. While this general decrease in error is promising, it should be noted that the systematic improvement is not observed at the prediction energy level. This arises due to the fact that cancellation of error plays a varying role in different chemical systems depending on the frequency with which different electronic environments occur. This variation in cancellation of error will also occur with other types of XC functionals, and explains the tremendous difficulty of achieving systematic improvement in the field of XC functional design. The counter-intuitive nature of cancellation of error is even more apparent when comparing prediction energy errors and formation energy errors. Formation energies generally benefit from cancellation of error across different systems, particularly in the core regions since the atomic composition of a molecule is utilized to compute the formation energy. This is apparent in the general improvement of between the prediction energy and the formation energy. However, when examining systematic improvement it is clear that some systems exhibit drastically larger formation-energy errors than prediction errors. This arises due to the anti-cancellation of error between a reference system and the system of interest, and highlights an additional consideration for the design of functionals with systematic improvements in formation energies. Nonetheless, several models ( $\bar{\lambda}_{(0.2)}^{(1)}$ , $\bar{\lambda}_{(0.08)}^{(2)}$ , $\bar{\lambda}_{(0.2)}^{(2)}$ ) are capable of reducing even the maximum formation-energy error of the convolutional descriptor models to within chemical accuracy.

III.3.3 Outliers and Extrapolation

The results discussed in Sec. III.3.2 are generally positive, although there is a noticeable outlier in the training set for some models, and the “universality” of the model is not clear since the test set contains similar chemistry to the training set. In this section we examine the outlier (\ceC_2H_6) to gain insight into where the approach fails, and attempt to extrapolate the model to three systems with different chemistry: $CH_{3}NO_{2}$ , $glycine$ and $NCCN$ . These compounds contain nitro groups, amine groups, and multiple cyano groups that are not present in the train or test data and hence probe the machine-learning model’s ability to generalize to new chemistries.

First, we address the \ceC_2H_6 system that appears as an outlier in the training set. This is generally surprising, since machine-learning models tend to perform well on data they are trained to. Notably, the $C_{2}H_{6}$ system is not a clear outlier for $\varepsilon_{absolute}$ (Figure 7(a)), which is directly related to the objective functions used in training (see SI), confirming that this is not a failure of the NN training procedure. However, \ceC_2H_6 becomes an outlier in $\varepsilon_{predict}$ (Figure 7(b)), and even more substantially in $\varepsilon_{formation}$ (Figure 6(c)). This indicates that the issue arises due to a lack of cancellation of error in the prediction energy, and/or anti-cancellation of error in the formation energy. This is attributed to a combination of two factors: electronic environments that are not sufficiently distinguished by descriptors and under-representation of these electronic environments. These factors are illustrated using the $NN[\bar{\lambda}_{(0.04)}^{(1)}]$ model. Since domain decomposition method is used in model training (see Sec. II), it is possible to determine that most of the prediction error (1.51 eV out of 1.67 eV for $\varepsilon_{absolute}$ and -1.32 eV out of -1.34 eV for $\varepsilon_{predict}$ ) can be attributed to the $C-C$ bonding region of the system. The electronic environments in this region were projected onto a low-dimensional space using principal component analysis (PCA) and compared to the environments in the entire training set. Figure 8 shows the model error as a function of two principal components, and illustrates that several points have substantially smaller errors for the training set as compared to C-C bonding region of \ceC_2H_6. This indicates that the objective function is multi-valued at these locations in descriptor space, forcing the model to make a tradeoff in accuracy between the two possible outcomes. This tradeoff will depend on the relative frequency of the two types of environments that are present in the training set. In this study \ceC_2H_6 is the only molecule in the training set with a C-C single bond, causing these environments to be under-represented and not favored by the model. This could be remedied to some extent by including more examples of C-C bonds in the training set, though this would simply balance the error between systems rather than reducing it. Alternatively, the inclusion of more descriptors enables the model to distinguish between these environments and reduce the error for both; this is evident from the fact that both the prediction and formation energy errors for the \ceC_2H_6 molecule reduce substantially as higher-order and longer-range descriptors are included (Fig. 6).

The results for the extrapolation set ( $CH_{3}NO_{2}$ , $glycine$ and $NCCN$ ) are shown in Table 3, and it is clear that the errors are generally larger by as much as an order of magnitude. This situation is common in machine learning, and generally arises when the training data is not representative of the test data. This occurs because the NN model can only interpolate between training examples, and will become unreliable if used for extrapolation [90]. In this case the extrapolation system contains several chemical environments that are not observed in the test systems, so it is not surprising that unique electronic environments arise. This is quantitatively illustrated in Fig. 9 in the case of the $NN[\bar{\lambda}_{(0.04)}^{(1)}]$ model for \ceCH_3NO_2, where it is clear that there are a large number of electronic environments that fall outside the domain of the training data. This is found to be consistent across the other extrapolation systems and in higher dimensions, as provided in the Supplementary Information. One straightforward solution to this issue is to add additional training systems that capture the chemistries of interest. This highlights the general limitation of machine-learning models that they are only as good as the data they are trained on. However, the problem can also be mitigated by increasing the dimensionality of the descriptor space. This is evident from the model performance, where the prediction error for these outlier systems is reduced to 0.08 eV, 0.19 eV and 0.12 eV, respectively, for the $NN[\bar{\lambda}_{(0.2)}^{(2)}]$ model. This occurs because as more information about the rotational and radial variations of the electron density is added these outlier systems become more similar to environments that exist in the training data. In these higher-dimensional spaces the new systems appear more like interpolations between existing environments as opposed to extrapolations beyond the domain of all training data. This phenomenon suggests that sufficiently large convolutional descriptor spaces, combined with diverse training data sets and comprehensive testing, may enable the construction of universal machine-learning XC functionals.

IV Conclusions

This work introduces convolutional descriptors as a promising new paradigm for the construction of model spaces for XC functionals. Convolutional descriptors provide a systematically expandable and theoretically complete feature space for constructing XC functionals in a finite difference representation. They are orbital-free and can be computed with $N\log(N)$ computational complexity. Furthermore, convolutional descriptors can be combined with non-linear regression models to construct machine-learning functionals. Using neural networks is particularly promising, since the universal approximation theorem ensures that NNs can represent an arbitrarily complex function. The resulting models are conceptually similar to convNets, suggesting that deep learning approaches are a promising route for functional development. A sub-class of convolutional descriptors, Maxwell-Cartesian spherical harmonics (MCSH) descriptors, were employed to construct and test a range of machine-learned orbital-free approximations to the hybrid B3LYP functional based on data from a total of 25 small-molecule systems containing C, H, O, and N. These descriptors provide a numerically stable and rotationally invariant route to capturing rotational and radial variations in the electron density. The machine-learning models are constructed from model spaces based on the descriptors with increasing range from 0.02 Å - 0.2 Å and progressively finer angular features from zero-order to second-order. The results show that the average accuracy of the models improves as either the range or rotation symmetry is increased. A systematic improvement in the absolute error is typically observed for both training and test sets, but the improvement in system-level energy and formation energy are not systematic due to cancellation of error.

In addition to these promising initial results, this work also identifies several challenges that must be addressed in the construction of XC functionals based on convolutional descriptors and/or machine learning. One challenge that is general to any approach that utilizes localized XC energy density is the ability to generate training data. Machine-learning approaches are most powerful when they are based on data from high-level methods for which no analytical form exists; however, these approaches are typically based on non-local integrals, so projecting the XC energy density to a finite difference grid is challenging. Approaches for this have been reported [91; 82; 92], but implementations are not openly available. Another related challenge is the fact that these high-level methods are typically all-electron, resulting in rapidly varying electron/energy density near the core regions. Accurately representing this with an finite difference grid requires very fine grid spacings (0.02 Å in this work). In this case although the theoretical scaling of convolutions is $N\log(N)$ , the size of $N$ is so large that the approach is much slower than the underlying B3LPY calculation. Similar concerns will be faced for other models seeking to reproduce the results of wavefunction-based theories, and routes to extract the XC contribution of valence electrons will be critical to training models that are consistent with pseudopotentials commonly used in practical DFT calculations. Finally, the optimization and numerical performance of machine-learning models must be considered. In this work NN models were used, leading to challenges in deconvoluting the error due to an insufficient model space and error due to sub-optimal hyperparameters or training procedures. The machine-learning models must also achieve a high accuracy over a large numerical range due to the large number of points with relatively low energy/electron density, and the quantities of interest (e.g. formation energy) rely on cancellation of error and may require specialized objective functions. This may present challenges in the application of out-of-the-box machine-learning models to the problem of XC functionals.

This work indicates that the combination of convolutional descriptors and machine learning models is a theoretically appealing framework for XC functional design. Despite practical challenges, the framework provides a route to empirically investigate fundamental questions about the nature of the XC energy. For example, this work provides empirical evidence that the exact exchange contribution of the B3LYP functional can be represented to within chemical accuracy (of system-level energies) for an orbital-free functional with a spatial range of $<$ 0.2 Å for small C, H, O, N molecules. Further examination of these numerical approximations may provide inspiration for new physical or empirical XC models with improved accuracy and practicality. In addition, there are many possible routes to improvement of the accuracy of these machine learning models. The choice of convolutional descriptors could be improved by inclusion of higher-order spherical harmonics, longer radial distances, decreasing grid-point space, integration with pseudopotentials, or the use of deep-learning convNets to automatically extract the optimal convolutional descriptors from the data. Training data can be extracted from high-level wavefunction theories, and convolutional models can be easily implemented in solid-state codes, providing a route to data-driven wavefunction embedding [93; 94; 95; 96]. Moreover, the MCSH descriptor sets introduced in this study are not specific to electron density, but could be applied generally to any 3D functions that are inherently rotationally invariant. This includes many problems in physics since rotational and translational invariance are common. These exciting possibilities suggest that further research into convolutional-based machine-learning functionals is a worthwhile addition to the already numerous strategies for density functional design.

V Acknowledgements

We acknowledge Daniel G. A. Smith and David Sherrill for assistance in extracting grid-resolved B3LYP XC densities, and Polo Chau and Fred Hohman for assistance in visualizing electron densities and descriptors. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences Computational Chemical Sciences program under Award Number DE-SC0019410. The authors are also grateful for a GPU generously provided by the NVIDIA GPU Grant program that was used to train the neural networks.

Bibliography96

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Hohenberg and Kohn [1964] P. Hohenberg and W. Kohn. Inhomogeneous electron gas. Phys. Rev. , 136:B 864–B 871, Nov 1964. doi: 10.1103/Phys Rev.136.B 864 .
2Kohn and Sham [1965] W. Kohn and L. J. Sham. Self-consistent equations including exchange and correlation effects. Phys. Rev. , 140:A 1133–A 1138, Nov 1965. doi: 10.1103/Phys Rev.140.A 1133 .
3Perdew [2001] John P. Perdew. Jacob’s ladder of density functional approximations for the exchange-correlation energy. In AIP Conference Proceedings . AIP, 2001. doi: 10.1063/1.1390175 .
4Becke [1988 a] A. D. Becke. Correlation energy of an inhomogeneous electron gas: A coordinate space model. The Journal of Chemical Physics , 88(2):1053–1062, January 1988 a. ISSN 0021-9606.
5Perdew and Yue [1986] John P. Perdew and Wang Yue. Accurate and simple density functional for the electronic exchange energy: Generalized gradient approximation. Physical Review B , 33(12):8800–8802, June 1986. ISSN 0163-1829.
6Perdew and Yue [1989] Perdew and Yue. Erratum: Accurate and simple density functional for the electronic exchange energy: Generalized gradient approximation. Physical review. B, Condensed matter , 40(5), August 1989. ISSN 0163-1829.
7Medvedev et al. [2017] Michael G. Medvedev, Ivan S. Bushmarinov, Jianwei Sun, John P. Perdew, and Konstantin A. Lyssenko. Density functional theory is straying from the path toward the exact functional. Science , 355(6320):49–52, jan 2017. doi: 10.1126/science.aah 5975 .
8Becke [1988 b] A.D. Becke. Density-functional exchange-energy approximation with correct asymptotic behavior. Physical Review A , 38(6):3098–3100, 1988 b. ISSN 10502947.