Meta-learning Pseudo-differential Operators with Deep Neural Networks

Jordi Feliu-Faba; Yuwei Fan; Lexing Ying

arXiv:1906.06782·math.NA·February 26, 2020

Meta-learning Pseudo-differential Operators with Deep Neural Networks

Jordi Feliu-Faba, Yuwei Fan, Lexing Ying

PDF

TL;DR

This paper presents a meta-learning method using deep neural networks to efficiently approximate parameterized pseudo-differential operators, enabling accurate solutions for complex PDEs with limited computations.

Contribution

It introduces a novel meta-learning framework that combines wavelet transforms and neural networks to approximate pseudo-differential operators from minimal data.

Findings

01

Efficient approximation of Green's functions for elliptic PDEs.

02

Accurate modeling of radiative transfer equations.

03

Reduced computational cost for operator evaluation.

Abstract

This paper introduces a meta-learning approach for parameterized pseudo-differential operators with deep neural networks. With the help of the nonstandard wavelet form, the pseudo-differential operators can be approximated in a compressed form with a collection of vectors. The nonlinear map from the parameter to this collection of vectors and the wavelet transform are learned together from a small number of matrix-vector multiplications of the pseudo-differential operator. Numerical results for Green's functions of elliptic partial differential equations and the radiative transfer equations demonstrate the efficiency and accuracy of the proposed approach.

Figures26

Click any figure to enlarge with its caption.

Tables5

Table 1. Table 1: Relative error in approximating the solution to the Schrödinger form in the 1D case.

$α$	$n_{cnn}$	$N_{params}$	$ϵ_{train}$	$ϵ_{test}$	$ϵ_{op}$
5	5	30201	4.43e-3	4.74e-3	2.49e-3
5	7	38061	4.83e-3	5.18e-3	4.28e-3
7	5	58717	4.09e-3	4.35e-3	3.28e-3
7	7	74089	4.11e-3	4.42e-3	2.18e-3

Table 2. Table 2: Relative error in approximating the solution to the Schrödinger form in the 2D case.

$α$	$n_{cnn}$	$N_{params}$	$ϵ_{train}$	$ϵ_{test}$	$ϵ_{op}$
11	5	930447	2.21e-2	2.18e-2	4.17e-3
15	5	1226071	2.12e-2	2.10e-2	2.04e-3

Table 3. Table 3: Relative error in approximating the solution of divergence form in the 1D case.

$α$	$n_{cnn}$	$N_{params}$	$ϵ_{train}$	$ϵ_{test}$
7	5	58717	7.27e-3	7.76e-3
7	7	74089	7.46e-3	8.29e-3
9	5	96625	6.05e-3	6.88e-3
9	7	122005	6.83e-3	8.21e-3

Table 4. Table 4: Relative error in approximating the solution to the 1D RTE.

$α$	$K$	$N_{params}$	$ϵ_{train}$	$ϵ_{test}$
5	5	34131	2.48e-3	2.93e-3
5	7	41991	2.46e-3	3.01e-3
7	5	66403	1.92e-3	2.45e-3
7	7	81775	2.05e-3	2.36e-3

Table 5. Table 5: Relative error in approximating the solution to the 2D RTE.

$α$	$K$	$N_{params}$	$ϵ_{train}$	$ϵ_{test}$
11	5	1287903	4.39e-3	4.39e-3
15	5	1663831	3.55e-3	3.55e-3

Equations127

L_{η} u (x) = f (x), x \in Ω \subset R^{d}

L_{η} u (x) = f (x), x \in Ω \subset R^{d}

M : η \to G_{η} = L_{η}^{- 1},

M : η \to G_{η} = L_{η}^{- 1},

G_{η} \approx W S [C_{η}] W^{T},

G_{η} \approx W S [C_{η}] W^{T},

(η_{i}, {f_{ij}, u_{ij}}),

(η_{i}, {f_{ij}, u_{ij}}),

φ_{k}^{(ℓ)} (x) = 2^{ℓ /2} φ (2^{ℓ} x - k), ℓ = 0, 1, 2, \dots, k \in Z .

φ_{k}^{(ℓ)} (x) = 2^{ℓ /2} φ (2^{ℓ} x - k), ℓ = 0, 1, 2, \dots, k \in Z .

φ (x) = 2 i \in Z \sum h_{i} φ (2 x - i) .

φ (x) = 2 i \in Z \sum h_{i} φ (2 x - i) .

\int_{R} φ (x - a) φ (x - b) d x = δ_{a, b}, \forall a, b \in Z,

\int_{R} φ (x - a) φ (x - b) d x = δ_{a, b}, \forall a, b \in Z,

i \in Z \sum h_{i}^{2} = 1, i \in Z \sum h_{i} h_{i + 2 m} = 0, m \in Z \ {0} .

i \in Z \sum h_{i}^{2} = 1, i \in Z \sum h_{i} h_{i + 2 m} = 0, m \in Z \ {0} .

ψ (x) = 2 i \in Z \sum g_{i} φ (2 x - i),

ψ (x) = 2 i \in Z \sum g_{i} φ (2 x - i),

ψ_{k}^{(ℓ)} (x) = 2^{ℓ /2} ψ (2^{ℓ} x - k), ℓ = 0, 1, 2, \dots, k \in Z .

ψ_{k}^{(ℓ)} (x) = 2^{ℓ /2} ψ (2^{ℓ} x - k), ℓ = 0, 1, 2, \dots, k \in Z .

v_{k}^{(ℓ)} := \int v (x) φ_{k}^{(ℓ)} (x) d x, d_{k}^{(ℓ)} := \int v (x) ψ_{k}^{(ℓ)} (x) d x .

v_{k}^{(ℓ)} := \int v (x) φ_{k}^{(ℓ)} (x) d x, d_{k}^{(ℓ)} := \int v (x) ψ_{k}^{(ℓ)} (x) d x .

v_{k}^{(ℓ)} = i \in Z \sum h_{i} v_{2 k + i}^{(ℓ + 1)}, d_{k}^{(ℓ)} = i \in Z \sum g_{i} v_{2 k + i}^{(ℓ + 1)} .

v_{k}^{(ℓ)} = i \in Z \sum h_{i} v_{2 k + i}^{(ℓ + 1)}, d_{k}^{(ℓ)} = i \in Z \sum g_{i} v_{2 k + i}^{(ℓ + 1)} .

v^{(ℓ)} = (W_{s}^{(ℓ)})^{T} v^{(ℓ + 1)}, d^{(ℓ)} = (W_{w}^{(ℓ)})^{T} v^{(ℓ + 1)},

v^{(ℓ)} = (W_{s}^{(ℓ)})^{T} v^{(ℓ + 1)}, d^{(ℓ)} = (W_{w}^{(ℓ)})^{T} v^{(ℓ + 1)},

(d^{(ℓ)} v^{(ℓ)}) = (W^{(ℓ)})^{T} v^{(ℓ + 1)}, v^{(ℓ + 1)} = W^{(ℓ)} (d^{(ℓ)} v^{(ℓ)}) .

(d^{(ℓ)} v^{(ℓ)}) = (W^{(ℓ)})^{T} v^{(ℓ + 1)}, v^{(ℓ + 1)} = W^{(ℓ)} (d^{(ℓ)} v^{(ℓ)}) .

\begin{array}[]{ccccccccccccccc}\cdots&\longrightarrow&v^{(\ell)}&\longrightarrow&v^{(\ell-1)}&\longrightarrow&v^{(\ell-2)}&\longrightarrow&\cdots&\longrightarrow&v^{(2)}&\longrightarrow&v^{(1)}&\longrightarrow&v^{(0)}\\ &\searrow&&\searrow&&\searrow&&\searrow&&\searrow&&\searrow&&\searrow\\ &&d^{(\ell)}&&d^{(\ell-1)}&&d^{(\ell-2)}&&\cdots&&d^{(2)}&&d^{(1)}&&d^{(0)}\end{array}.

\begin{array}[]{ccccccccccccccc}\cdots&\longrightarrow&v^{(\ell)}&\longrightarrow&v^{(\ell-1)}&\longrightarrow&v^{(\ell-2)}&\longrightarrow&\cdots&\longrightarrow&v^{(2)}&\longrightarrow&v^{(1)}&\longrightarrow&v^{(0)}\\ &\searrow&&\searrow&&\searrow&&\searrow&&\searrow&&\searrow&&\searrow\\ &&d^{(\ell)}&&d^{(\ell-1)}&&d^{(\ell-2)}&&\cdots&&d^{(2)}&&d^{(1)}&&d^{(0)}\end{array}.

\begin{array}[]{ccccccccccccc}&&\cdots&\longrightarrow&v^{(\ell+1)}&\longrightarrow&v^{(\ell)}&\longrightarrow&v^{(\ell-1)}&\longrightarrow&\cdots&\longrightarrow&v^{(L_{0})}\\ &&&\searrow&&\searrow&&\searrow&&\searrow&&\searrow\\ &&&&d^{(\ell+1)}&&d^{(\ell)}&&d^{(\ell-1)}&&\cdots&&d^{(L_{0})}\end{array}.

\begin{array}[]{ccccccccccccc}&&\cdots&\longrightarrow&v^{(\ell+1)}&\longrightarrow&v^{(\ell)}&\longrightarrow&v^{(\ell-1)}&\longrightarrow&\cdots&\longrightarrow&v^{(L_{0})}\\ &&&\searrow&&\searrow&&\searrow&&\searrow&&\searrow\\ &&&&d^{(\ell+1)}&&d^{(\ell)}&&d^{(\ell-1)}&&\cdots&&d^{(L_{0})}\end{array}.

u = A v, equivalently u (x) = \int a (x, y) v (y) d y .

u = A v, equivalently u (x) = \int a (x, y) v (y) d y .

A_{k_{1}, k_{2}}^{(L)} = \int\int φ_{k_{1}}^{(L)} (x) a (x, y) φ_{k_{2}}^{(L)} (y) d x d y .

A_{k_{1}, k_{2}}^{(L)} = \int\int φ_{k_{1}}^{(L)} (x) a (x, y) φ_{k_{2}}^{(L)} (y) d x d y .

D_{1, k_{1}, k_{2}}^{(ℓ)}

D_{1, k_{1}, k_{2}}^{(ℓ)}

D_{3, k_{1}, k_{2}}^{(ℓ)}

A^{(ℓ)} = (A_{k_{1}, k_{2}}^{(ℓ)})_{k_{1}, k_{2} = 0, \dots, 2^{ℓ} - 1}, D_{j}^{(ℓ)} = (D_{j, k_{1}, k_{2}}^{(ℓ)})_{k_{1}, k_{2} = 0, \dots, 2^{ℓ} - 1}, j = 1, 2, 3.

A^{(ℓ)} = (A_{k_{1}, k_{2}}^{(ℓ)})_{k_{1}, k_{2} = 0, \dots, 2^{ℓ} - 1}, D_{j}^{(ℓ)} = (D_{j, k_{1}, k_{2}}^{(ℓ)})_{k_{1}, k_{2} = 0, \dots, 2^{ℓ} - 1}, j = 1, 2, 3.

(D_{1}^{(ℓ)} D_{3}^{(ℓ)} D_{2}^{(ℓ)} A^{(ℓ)}) = (W^{(ℓ)})^{T} A^{(ℓ + 1)} W^{(ℓ)}, ℓ = L_{0}, \dots, L - 1.

(D_{1}^{(ℓ)} D_{3}^{(ℓ)} D_{2}^{(ℓ)} A^{(ℓ)}) = (W^{(ℓ)})^{T} A^{(ℓ + 1)} W^{(ℓ)}, ℓ = L_{0}, \dots, L - 1.

S^{(L_{0})} = (D_{1}^{(L_{0})} D_{3}^{(L_{0})} D_{2}^{(L_{0})} A^{(L_{0})}), S^{(ℓ + 1)} = D_{1}^{(ℓ + 1)} D_{3}^{(ℓ + 1)} 0 D_{2}^{(ℓ + 1)} 00 00 S^{(ℓ)}, ℓ = L_{0}, \dots, L - 1.

S^{(L_{0})} = (D_{1}^{(L_{0})} D_{3}^{(L_{0})} D_{2}^{(L_{0})} A^{(L_{0})}), S^{(ℓ + 1)} = D_{1}^{(ℓ + 1)} D_{3}^{(ℓ + 1)} 0 D_{2}^{(ℓ + 1)} 00 00 S^{(ℓ)}, ℓ = L_{0}, \dots, L - 1.

A = W S W^{T} .

A = W S W^{T} .

T^{(L_{0})} = W^{(L_{0})}, T^{(ℓ + 1)} = (W^{(ℓ + 1)} W_{s}^{(ℓ + 1)} T^{(L_{0})}), ℓ = L_{0}, \dots, L - 1, W := T^{(L)} .

T^{(L_{0})} = W^{(L_{0})}, T^{(ℓ + 1)} = (W^{(ℓ + 1)} W_{s}^{(ℓ + 1)} T^{(L_{0})}), ℓ = L_{0}, \dots, L - 1, W := T^{(L)} .

u^{(L)} = A^{(L)} v^{(L)} .

u^{(L)} = A^{(L)} v^{(L)} .

u^{(L)} = W S W^{T} v^{(L)}

u^{(L)} = W S W^{T} v^{(L)}

(w^{(ℓ)} s^{(ℓ)}) = (D_{1}^{(ℓ)} D_{3}^{(ℓ)} D_{2}^{(ℓ)} D_{4}^{(ℓ)}) (d^{(ℓ)} v^{(ℓ)}),

(w^{(ℓ)} s^{(ℓ)}) = (D_{1}^{(ℓ)} D_{3}^{(ℓ)} D_{2}^{(ℓ)} D_{4}^{(ℓ)}) (d^{(ℓ)} v^{(ℓ)}),

u^{(L_{0})} = 0, u^{(ℓ + 1)} = W^{(ℓ)} (w^{(ℓ)} s^{(ℓ)} + u^{(ℓ)}), ℓ = L_{0}, \dots, L - 1.

u^{(L_{0})} = 0, u^{(ℓ + 1)} = W^{(ℓ)} (w^{(ℓ)} s^{(ℓ)} + u^{(ℓ)}), ℓ = L_{0}, \dots, L - 1.

ψ_{1, k}^{(ℓ)} (x, y) = φ_{k_{1}}^{(ℓ)} (x) ψ_{k_{2}}^{(ℓ)} (y), ψ_{2, k}^{(ℓ)} (x, y) = ψ_{k_{1}}^{(ℓ)} (x) φ_{k_{2}}^{(ℓ)} (y), ψ_{3, k}^{(ℓ)} (x, y) = ψ_{k_{1}}^{(ℓ)} (x) ψ_{k_{2}}^{(ℓ)} (y),

ψ_{1, k}^{(ℓ)} (x, y) = φ_{k_{1}}^{(ℓ)} (x) ψ_{k_{2}}^{(ℓ)} (y), ψ_{2, k}^{(ℓ)} (x, y) = ψ_{k_{1}}^{(ℓ)} (x) φ_{k_{2}}^{(ℓ)} (y), ψ_{3, k}^{(ℓ)} (x, y) = ψ_{k_{1}}^{(ℓ)} (x) ψ_{k_{2}}^{(ℓ)} (y),

d_{1}^{(ℓ)} d_{2}^{(ℓ)} d_{3}^{(ℓ)} v^{(ℓ)} = (W^{(ℓ)})^{T} v^{(ℓ + 1)}, v^{(ℓ + 1)} = W^{(ℓ)} d_{1}^{(ℓ)} d_{2}^{(ℓ)} d_{3}^{(ℓ)} v^{(ℓ)} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Meta-learning Pseudo-differential Operators

with Deep Neural Networks

Jordi Feliu-Fabà , Yuwei Fan , Lexing Ying ICME, Stanford University, Stanford, CA 94305. Email: [email protected] of Mathematics, Stanford University, Stanford, CA 94305. Email: [email protected] of Mathematics and ICME, Stanford University, Stanford, CA 94305. Email: [email protected]

Abstract

This paper introduces a meta-learning approach for parameterized pseudo-differential operators with deep neural networks. With the help of the nonstandard wavelet form, the pseudo-differential operators can be approximated in a compressed form with a collection of vectors. The nonlinear map from the parameter to this collection of vectors and the wavelet transform are learned together from a small number of matrix-vector multiplications of the pseudo-differential operator. Numerical results for Green’s functions of elliptic partial differential equations and the radiative transfer equations demonstrate the efficiency and accuracy of the proposed approach.

Keywords: Deep neural networks; Convolutional neural networks; Nonstandard wavelet form; Meta-learning; Green’s functions; Radiative transfer equation.

1 Introduction

Many physical models for scientific and engineering applications can be written in a general form

[TABLE]

for a domain $\Omega$ with appropriate boundary conditions, where $L_{\eta}$ is often a partial differential or integral operator parameterized by a parameter function $\eta(x)$ . Solving for $u(x)$ for a given $f(x)$ amounts to representing the inverse operator (sometimes also known as the Green’s function) $G_{\eta}=L_{\eta}^{-1}$ either explicitly or implicitly via an efficient algorithm. Representing $G_{\eta}$ , even if implicitly, can be computationally challenging, especially for multidimensional problems. The past few decades have witnessed steady progresses in developing efficient algorithms for this.

Problem statement.

This paper is concerned with a more ambitious task: representing the nonlinear map from $\eta$ to $G_{\eta}$

[TABLE]

when the operator $G_{\eta}$ is a pseudo-differential operator (PDO) [60]. Although $L_{\eta}$ and $G_{\eta}$ can be linear operators, this map $\mathcal{M}$ from $\eta$ to $G_{\eta}$ is highly nonlinear.

Background.

In the recent years, deep learning has become the most versatile and effective tool in artificial intelligence and machine learning, witnessed by impressive achievements in computer vision [36, 61, 28], speech and natural language processing [29, 56, 51, 57, 11], drug discovery [43] or game playing [54, 15, 58]. Recent reviews on deep learning and its impacts on other fields can be found in for example [38, 53]. At the center of deep learning, the model of deep neural networks (NNs) provides a flexible framework for approximating high-dimensional functions, while allowing for efficient training and good generalization properties in practice [42, 47].

More recently, several groups have started applying NNs to partial differential equations (PDEs) and integral equations (IEs) arising from physical systems. In one direction, the NN model has been used to approximate solutions of high-dimensional PDEs [37, 55, 14, 49, 4, 6, 13, 25, 32, 41]. In a somewhat orthogonal direction, the NNs have been utilized to approximate the high-dimensional parameter-to-solution of various PDEs and IEs [31, 26, 18, 17, 19, 33, 25, 2, 39, 20].

Another topic from machine learning that is particularly relevant to this work is meta-learning or learning-to-learn [52, 3, 27, 21, 59]. A meta-learning system learns to produce learning models for new tasks and scenarios from their metadata with zero or minimum amount of new data, by leveraging the common structure among different tasks. Due to the low requirements on new data points, meta-learning has gained a lot of attention in recent years in applications such as vision and reinforcement learning.

Main idea.

Following these recent advances in applying NNs to physical models, this paper takes a deep learning approach for representing the map in Eq. 1.2. The most straightforward solution would be to take a supervised learning approach, i.e., trying to learn the map $\mathcal{M}:\eta\rightarrow G_{\eta}$ from a large set of training data $\{(\eta_{i},G_{\eta_{i}})\}_{i}$ . However, since it is often difficult or even impossible to compute and store $G_{\eta_{i}}$ due to the enormous discretization size, this straightforward supervised learning approach is not practical for Eq. 1.2.

Without explicit access to $G_{\eta}$ , we take a meta-learning approach, i.e., learning to produce, for each new $\eta$ , an NN approximation to $G_{\eta}$ . To do this, we are faced with two key difficulties.

•

How should we represent the output $G_{\eta}$ for an arbitrary input $\eta$ ?

•

How should we represent the training data?

To address the first question, $G_{\eta}$ should be represented in a compressed form. For pseudo-differential operators, several compressed representations exist, including hierarchical matrices [22, 23, 24], discrete symbol calculus [10], etc. In this paper, we choose to represent $G_{\eta}$ with the nonstandard wavelet form introduced in [5]. The main advantage of the nonstandard wavelet form is that the nonzero entries of this compressed representation are simply organized into a small number of vectors. More precisely,

[TABLE]

where $W$ is a redundant form of a wavelet transform, $C_{\eta}$ stands for the collection of vectors that contain the nonzero entries of the compressed form, and $\mathcal{S}$ is a certain operator that generates a sparse matrix from the vector collection $C_{\eta}$ . Compared to [5], a key difference is that the current approach allows for $W$ to be fine-tuned for the map $\mathcal{M}$ .

To address the second question, instead of explicitly representing $G_{\eta_{i}}$ , the training data consists of samples of the form

[TABLE]

where $u_{ij}=G_{\eta_{i}}f_{ij}$ . For a fixed $\eta_{i}$ , such data can be obtained by solving the equation $L_{\eta_{i}}u_{ij}=f_{ij}$ for each $f_{ij}$ , possibly with a fast algorithm.

Putting these two pieces together, the meta-learning approach of this paper learns two following key objects from the training data of form $\{(\eta_{i},\{f_{ij},u_{ij}\}_{j})\}_{i}$ :

•

a map from $\eta$ to the vector collection $C_{\eta}$ ,

•

the $\mathcal{M}$ -dependent wavelet transform $W$ .

Once trained, for a given test input $\eta$ the architecture calculates $C_{\eta}$ and returns a linear NN that implements $W\mathcal{S}[C_{\eta}]W^{\mathsf{T}}\approx G_{\eta}$ .

Organization.

The rest of this paper is organized as follows. Section 2 briefly reviews the nonstandard wavelet form, used for representing $G_{\eta}$ . In Section 3, the NN architecture of the meta-learning approach is discussed in detail. Section 4 applies the proposed NN to the Green’s function of elliptic PDEs, in both the Schrödinger form and the divergence form. The application to the radiative transfer equation is presented in Section 5.

2 Nonstandard wavelet form

This section summarizes the nonstandard wavelet form proposed in [5]. To make things concrete, compactly supported orthonormal Daubechies wavelets [8] are used as the basis functions as an example.

2.1 Wavelet transform

In the one-dimensional multiresolution analysis, one starts by defining a scaling function $\varphi(x)$ that generates, through dyadic translations and dilations, a family of functions

[TABLE]

For each scale $\ell$ , the functions $\{\varphi_{k}^{(\ell)}\}$ form a Ritz basis for a space $V_{\ell}$ , which satisfies a nested relationship $V_{\ell}\subset V_{\ell+1}$ . This nested property of $\{V_{\ell}\}$ implies the following dilation relation of the scaling function

[TABLE]

For the Daubechies’ wavelets [8], the scaling function $\varphi(x)$ has a compact support $[0,2p-1]$ for a given positive integer $p$ and therefore the coefficients $\{h_{i}\}$ are only nonzero for $i=0,\dots,2p-1$ . The scaling function also satisfies the orthonormal condition

[TABLE]

which leads to an orthonormal condition for the coefficients $\{h_{i}\}$

[TABLE]

Given the scaling function $\varphi(x)$ , another important component of the multiresolution analysis is the wavelet function $\psi(x)$ , defined by

[TABLE]

where $g_{i}=(-1)^{1-i}h_{1-i}$ for $i\in\mathbb{Z}$ . A simple calculation shows that the support of $\psi(x)$ is $[-p+1,p]$ and $\{g_{i}\}$ is nonzero only for $i=-2p+2,\ldots,1$ , based on the support of the $\varphi$ and the nonzero entries pattern of $\{h_{i}\}$ . The Daubechies wavelets are then defined as

[TABLE]

For a function $v(x)\in L^{2}(\mathbb{R})$ , its scaling and wavelet coefficients $d_{k}^{(\ell)}$ and $v_{k}^{(\ell)}$ are defined as the inner product with the scaling functions and the wavelets

[TABLE]

Using the recursive relationships of the scaling function Eq. 2.2 and the wavelet function Eq. 2.5, one obtains a recursive relationship of the scaling and wavelet coefficients

[TABLE]

By defining $v^{(\ell)}=\left(v^{(\ell)}_{k}\right)_{k\in\mathbb{Z}}$ and $d^{(\ell)}=\left(d^{(\ell)}_{k}\right)_{k\in\mathbb{Z}}$ , Eq. 2.8 can be written in a matrix form

[TABLE]

where the operators $W_{s}^{(\ell)}$ and $W_{w}^{(\ell)}:\ell^{2}(\mathbb{Z})\to\ell^{2}(\mathbb{Z})$ are banded with a bandwidth $2p$ due to the support of $\{h_{i}\}$ and $\{g_{i}\}$ . By introducing the orthogonal operator $W^{(\ell)}=\left(W_{w}^{(\ell)}\quad W_{s}^{(\ell)}\right)$ , Eq. 2.9 can be rewritten as

[TABLE]

The procedure for computing the wavelet and scaling coefficients can be illustrated in the following diagram

[TABLE]

The discussion until now is concerned with the wavelets on $\mathbb{R}$ . It is straightforward to extend it the functions defined on a finite domain with periodic boundary condition. If the function $v(x)$ is periodic on a finite domain, for instance, $[0,1]$ , then the only modification is that all the shifts and scaling in the $x$ variable are done modulus the integer. When working with periodic functions, the procedure in Eq. 2.11 usually stops at a coarse level $L_{0}=O(\log_{2}(p))$ before the wavelet and scaling functions start to overlap itself.

[TABLE]

2.2 Nonstandard wavelet form for integral operator

Let $A$ be an integral operator with kernel $a(x,y)$ , applied to periodic functions defined on $[0,1]$ , i.e.,

[TABLE]

Denote by $A^{(L)}=\left(A_{k_{1},k_{2}}^{(L)}\right)\in\mathbb{R}^{2^{L}\times 2^{L}}$ the Galerkin projection of $A$ to the space $V_{L}$ , for a sufficiently deep level $L$ , i.e.

[TABLE]

The nonstandard form described in [5] is a remarkably efficient way to compress the matrix $A^{(L)}$ .

The main step for the nonstandard form is to treat $A^{(L)}$ as an image and use the 2D multiresolution analysis

[TABLE]

for $\ell=L_{0},\dots,L-1$ , and $k_{1},k_{2}=0,\dots,2^{\ell}-1$ . For convenience, these coefficients are organized into the matrix form as

[TABLE]

In this setting, a similar recursive relation to Eq. 2.10 can be obtained

[TABLE]

If $A$ is a Calderon-Zygmund operator, the entries of the matrices $D_{j}^{(\ell)}$ with $j=1,2,3$ decay rapidly away from the diagonal. For a prescribed relative accuracy $\epsilon$ , each matrix $D_{j}^{(\ell)}$ can be approximated by a band matrix by truncating at a band of width $O(\log(1/\epsilon))$ . Since the bandwidth is independent of the specific choices of $\ell$ , $j$ , or the mesh size $N=2^{L}$ , the nonstandard form of $A$ stores only $O(N)$ nonzero entries. The readers are referred to [5] for more details. With a slight abuse of notation, the matrices $D_{j}^{(\ell)}$ are assumed to be pre-truncated in what follows.

One can assemble all the matrices $D_{j}^{\ell}$ and $A^{(L_{0})}$ together, by defining the matrix $S^{(L)}$ in a recursive way as

[TABLE]

The matrix $S:=S^{(L)}$ is the nonstandard form of the matrix $A=A^{(L)}$ satisfying

[TABLE]

Here $W$ is the extended wavelet transform matrix, defined in the recursive form as

[TABLE]

Figure 1 illustrates the matrices $W$ and $S$ along with the formulation Eq. 2.18.

To clarify the notations, we denote by $W_{s}^{(\ell)}$ and $W_{w}^{(\ell)}$ the transform matrices defined in Eq. 2.9 for the scaling and wavelet parts on level $\ell$ , respectively. $W^{(\ell)}=(W_{w}^{(\ell)}\quad W_{s}^{(\ell)})$ is the wavelet transform matrix at the level $\ell$ .

2.3 Matrix-vector multiplication in the nonstandard form

With a Galerkin discretization of Eq. 2.13 at level $L$ , the matrix-vector multiplication takes the form

[TABLE]

The nonstandard form allows for accelerating the evaluation of Eq. 2.20. Using the nonstandard form $A^{(L)}=WSW^{\mathsf{T}}$ obtained above, the matrix-vector multiplication

[TABLE]

can be split into four steps:

$A^{(L)}\to S$ : generate the nonstandard form $S$ from the matrix $A^{(L)}$ or the kernel $a(x,y)$ ; 2. 2.

$v^{(L)}\to\hat{v}:=W^{\mathsf{T}}v^{(L)}$ : apply (forward) wavelet transform on $v^{(L)}$ to get $\hat{v}$ ; 3. 3.

$\hat{u}:=S\hat{v}$ : evaluate the matrix-vector multiplication in the nonstandard form; 4. 4.

$\hat{u}\to u^{(L)}:=W\hat{u}$ : apply inverse wavelet transform on $\hat{u}$ to obtain $u^{(L)}$ .

The first step is computed using Eq. 2.16 if the matrix $A^{(L)}$ is given. The second step follows Eq. 2.10. The third step can be written as

[TABLE]

where $D_{4}^{(\ell)}=0$ for $\ell=L_{0}+1,\dots,L-1$ and $D_{4}^{(L_{0})}=A^{(L_{0})}$ . The fourth step is essentially an inverse wavelet transform, implemented as

[TABLE]

A step-by-step description of these four steps are summarized in Algorithm 1.

2.4 The multidimensional case

The matrix-vector multiplication in the nonstandard form can be easily extended to the multidimensional case with the help of multidimensional orthogonal wavelets (see [44] for more details). For instance, in the two-dimensional setting, one defines at each scale $\ell$ three different types of wavelets of the form

[TABLE]

with $k=(k_{1},k_{2})\in\mathbb{Z}^{2}$ . Using these three types of wavelets, the transform matrix at each scale $\ell$ used in Eq. 2.10 is redefined to be $W^{(\ell)}=\left(W_{w,1}^{(\ell)}\quad W_{w,2}^{(\ell)}\quad W_{w,3}^{(\ell)}\quad W_{s}^{(\ell)}\right)$ . The 2D analog of Eq. 2.10 contains three types of wavelet coefficients

[TABLE]

Similarly, the recursive relation Eq. 2.16 can be extended as well

[TABLE]

where $D_{j}^{(\ell)}$ , $j=1,\dots,15$ are all sparse matrices with only $O(4^{\ell})$ non-negligible entries in each. The matrix-vector multiplication follows the steps of Algorithm 1, with these necessary changes.

3 Meta-learning approach

The plan is to apply the nonstandard form to the operator $G_{\eta}$ in Eq. 1.2

[TABLE]

With a slight abuse of notations, the same letters are used to denote the discretizations. The discrete version of Eq. 3.1 takes the form

[TABLE]

with $N=2^{L}$ . The main goal of this paper is to construct a neural network to learn the map $\eta\to G_{\eta}$ .

Following Eq. 2.18 and applying the wavelet transform to the matrix $G_{\eta}$ leads to

[TABLE]

where $W$ is the extended wavelet transform matrix, independent of the parameter $\eta$ . Since each block of matrix $S_{\eta}$ is a band matrix, the nonzero entries of each block can be represented by a set of vectors. Let us define $C_{\eta}^{(\ell)}$ , of size $2^{\ell}\times n_{c}$ , to be the collection of these vectors of $S_{\eta}$ at level $\ell$ , with $n_{c}$ dependent on the bandwidth and $\ell$ . By introducing the collection of vectors $C_{\eta}:=\{C_{\eta}^{(\ell)}\}_{\ell=L_{0},\dots,L-1}$ , $S_{\eta}$ is uniquely determined by $C_{\eta}$ , i.e.,

[TABLE]

for a fixed embedding operator $\mathcal{S}$ determined by the sparsity pattern of $S_{\eta}$ .

Given a set of training samples of the form

[TABLE]

where $u_{ij}=G_{\eta_{i}}f_{ij}$ can be obtained by solving $L_{\eta_{i}}u_{ij}=f_{ij}$ with right hand side $f_{ij}$ , the meta-learning approach first learns both the map $\eta\to C_{\eta}$ and the wavelet transform matrix $W$ . Once they are ready, given any new $\eta$ , $G_{\eta}$ can be approximated by evaluating the map $\eta\to C_{\eta}$ and representing (3.3) in an NN form.

3.1 Neural network architecture

Using the factorization of $G_{\eta}$ in Eq. 3.3, one can factorize $u_{ij}=G_{\eta_{i}}f_{ij}$ as

[TABLE]

Similar to the matrix-vector multiplication in Section 2.3 of the nonstandard form, we propose a neural network for meta-learning Eq. 3.5 with four modules:

$\eta\to S_{\eta}$ : a module learns the map $\eta\to C_{\eta}$ and then generates the banded sparse matrix $S_{\eta}$ from $C_{\eta}$ (denoted as $S_{\eta}=\mathcal{S}[C_{\eta}]$ ); 2. 2.

$v^{(L)}\to\hat{v}:=W^{\mathsf{T}}v^{(L)}$ : a module applies the forward wavelet transform to $v^{(L)}$ to generate $\hat{v}$ ; 3. 3.

$\hat{u}:=S_{\eta}\hat{v}$ : a module evaluates the matrix-vector multiplication in the nonstandard form; 4. 4.

$\hat{u}\to u^{(L)}:=W\hat{u}$ : a module applies the inverse wavelet transform on $\hat{u}$ to generate $u^{(L)}$ .

Instead of computing $C_{\eta}$ from the full operator $G_{\eta}$ as described in Section 2.3. the first module forms $C_{\eta}$ directly from the parameter $\eta$ using a deep NN. This module can be split into two steps: (1) carrying out the map $\eta\to C_{\eta}^{(\ell)}$ for each scale $\ell$ ; (2) constructing the nonstandard form $S_{\eta}$ from $C_{\eta}:=\{C_{\eta}^{(\ell)}\}_{\ell=L_{0},\dots,L-1}$ . The NN architecture for the map $\eta\to C_{\eta}^{(\ell)}$ is often problem-dependent. For many applications, including the ones to be considered in Sections 4 and 5, the problem is often translation-invariant, i.e., for any translation operator $T$ ,

[TABLE]

For such problems, a convolutional NN is often used for its efficiency and robustness.

The second and fourth modules perform the forward and inverse wavelet transforms (as in Section 2.3), respectively, for a specific wavelet basis. The selection of an effective wavelet basis is often problem-dependent. The capability of learning a problem-dependent wavelet transform from data is essential for the accuracy of the NN architecture.

Combining these four modules results in the architecture summarized in Algorithm 2. An illustration is given in Fig. 2. Below we describe details of the layers and parameters used in this architecture.

Implementation details.

The input, output, and intermediate data of the NN architecture are all represented with $2$ -tensors. For a tensor of size $N\times\alpha$ , we refer to $N$ as the spatial dimensions and $\alpha$ as the channel dimension. The main tool is the convolutional layer. Given an input tensor $\xi$ of size $N\times\alpha$ , the convolutional layers outputs a tensor $\zeta$ of size $N^{\prime}\times\alpha^{\prime}$ obtained via

[TABLE]

where $w$ is the window size, $s$ is the stride and $\phi$ is the activation function, usually chosen to be a linear function, a rectified-linear unit (ReLU) function, or a sigmoid function. We denote this convolutional layer as

[TABLE]

The basic building blocks and layers used in Algorithm 2 are listed below.

•

$\eta\to C_{\eta}^{(\ell)}:C_{\eta}^{(\ell)}={{\sf{ConvNet}}}[\ell,\alpha,n_{\mathrm{cnn}}](\eta)$ . As discussed above, it is often a convolutional NN if the system Eq. 3.1 is translation invariant. Since the spatial size of $\eta$ is greater than that of $C_{\eta}^{(\ell)}$ , ${{\sf{ConvNet}}}[\ell,\alpha,n_{\mathrm{cnn}}]$ consists of $n_{\mathrm{cnn}}$ convolutional layers and several downsampling or pooling layers.

•

Forward wavelet transform at level $\ell$ : $(d^{(\ell)},v^{(\ell)})={\sf{FWT}}[\alpha](v^{(\ell+1)})$ . This is the NN representation of the first equation in Eq. 2.10. It is implemented as $f^{(\ell)}={{\sf{Conv1d}}}[2\alpha,2p,2,{{\sf{id}}}](v^{(\ell)})$ , where the first $\alpha$ and the last $\alpha$ channels of $f^{(\ell)}$ are assigned to $d^{(\ell)}$ and $v^{(\ell)}$ , respectively.

•

Inverse wavelet transform at level $\ell$ : $u^{(\ell+1)}={\sf{IWT}}[\alpha]([w^{(\ell)},s^{(\ell)}])$ . This is the NN representation of the second equation in Eq. 2.10. The expression $[w^{(\ell)},s^{(\ell)}]$ stands for concatenating the $2$ -tensors $w^{(\ell)}$ and $s^{(\ell)}$ of size $2^{(\ell)}\times\alpha$ to a $2$ -tensor of size $2^{(\ell)}\times 2\alpha$ along the channel dimension. This layer first applies the inverse transform, implemented by ${{\sf{Conv1d}}}[2\alpha,p,1,{{\sf{id}}}]$ , and then reshapes the output of size $2^{\ell}\times 2\alpha$ to a $2$ -tensor of size $2^{\ell+1}\times\alpha$ by a column-first ordering.

The generation of $D_{j}^{(\ell)}$ and $A^{(L_{0})}$ from $C_{\eta}^{\ell}$ in 7 and the matrix-vector multiplication in 16 of Algorithm 2 require some discussion. Figure 3 illustrates two approaches for evaluating the matrix-vector multiplication of a band matrix whose nonzero entries are stored in a set of vectors. The left figure corresponds to the case in Algorithm 2, while the right one is used in the actual implementation. To avoid the copying and shifting of $d^{(\ell)}$ and $v^{(\ell)}$ , it is convenient to set $\alpha_{2}=\alpha_{1}=\alpha$ . Though there are slightly more NN parameters in this case, this implementation change allows for a more flexible NN that can learn faster.

3.2 The multidimensional case

Let us focus on the 2D case. The input, the output and the intermediate data are all $3$ -tensors of size $N_{1}\times N_{2}\times\alpha$ , where $N=(N_{1},N_{2})$ is the spatial dimension and $\alpha$ is the channel dimension. The convolutional layer takes the form

[TABLE]

where the input $\xi$ is of size $N_{1}\times N_{2}\times\alpha$ and the output $\zeta$ is of size $N_{1}^{\prime}\times N_{2}^{\prime}\times\alpha^{\prime}$ . Here, the same stride $s$ and window size $w$ are used in both dimensions. We denote this convolutional layer as

[TABLE]

Algorithm 2 can be easily extended to the 2D case, following the same way that Algorithm 1 was extended in Section 2.4. Since there are three different types of wavelets in 2D, the layers in Algorithm 2 are redefined as follows:

•

$\eta\to C_{\eta}^{(\ell)}$ module: $C_{\eta}^{(\ell)}={{\sf{ConvNet}}}[\ell,\alpha,n_{\mathrm{cnn}}](\eta)$ . This module is often a two-dimensional convolutional NN with several downsampling or pooling layers.

•

Wavelet transform at level $\ell$ : $(d_{1}^{(\ell)},d_{2}^{(\ell)},d_{3}^{(\ell)},v^{(\ell)})={\sf{FWT}}[\alpha](v^{(\ell+1)})$ . This is implemented using $f^{(\ell)}={{\sf{Conv2d}}}[4\alpha,2p,2,{{\sf{id}}}](v^{(\ell)})$ . The first, second, third and last $\alpha$ channels of $f^{(\ell)}$ are assigned to $d_{1}^{(\ell)}$ , $d_{2}^{(\ell)}$ , $d_{3}^{(\ell)}$ and $v^{(\ell)}$ , respectively.

•

Inverse wavelet transform at level $\ell$ : $u^{(\ell+1)}={\sf{IWT}}[\alpha]([w_{1}^{(\ell)},w_{2}^{(\ell)},w_{3}^{(\ell)},s^{(\ell)}]$ . This is implemented by first computing ${{\sf{Conv2d}}}[4\alpha,p,1,{{\sf{id}}}]([d_{1}^{(\ell)},d_{2}^{(\ell)},d_{3}^{(\ell)},v^{(\ell)}+u^{(\ell)}])$ , and then reshaping the output of size $2^{\ell}\times 2^{\ell}\times 4\alpha$ to a $3$ -tensor of size $2^{\ell+1}\times 2^{\ell+1}\times\alpha$ . The reshape operation is performed as follows: (1) reshape the output to a $5$ -tensor of size $2^{\ell}\times 2^{\ell}\times 2\times 2\times\alpha$ by splitting the last dimension; (2) permute the second and third dimensions to obtain a $5$ -tensor of size $2^{\ell}\times 2\times 2^{\ell}\times 2\times\alpha$ ; (3) group the first and second dimensions, and the third and fourth dimensions, respectively, to obtain the resulting $3$ -tensor of size $2^{\ell+1}\times 2^{\ell+1}\times\alpha$ .

4 Elliptic partial differential equations

This section applies the meta-learning approach described in Section 3 to the Green’s functions of elliptic PDEs, both in the Schrödinger form and in the divergence form.

4.1 Schrödinger form

Consider the equation

[TABLE]

with a periodic boundary condition, where $\eta(x)>0$ is the potential and $f(x)$ is the source term. Following the notations of Section 1,

[TABLE]

Since the problem Eq. 4.1 is translation-invariant due to the periodic boundary condition, the map $\eta\to C_{\eta}^{(\ell)}$ can be represented with a convolutional NN. In what follows, we first derive the explicit dependence of $C_{\eta}^{(\ell)}$ on $\eta$ using a linear perturbative analysis and then report some numerical studies.

Mathematical analysis.

When $\eta$ is close to a fixed homogeneous background $\eta_{0}>0$ , it is convenient to write

[TABLE]

Let $G_{0}=L_{0}^{-1}$ be the Green’s function of $L_{0}$ with the periodic boundary condition. Using the Neumann series for the resolvent $(I-G_{0}E_{\eta})^{-1}$ with $|\eta(x)-\eta_{0}|$ sufficiently small, one can write the Green’s function $G_{\eta}$ as a perturbative expansion

[TABLE]

For sufficiently small $|\eta(x)-\eta_{0}|$ , the operator $G_{\eta}$ can be approximated by its linear part as

[TABLE]

Let $g_{0}(x)$ and $g_{\eta}(x)$ be the kernel of $G_{0}$ and $G_{\eta}$ , respectively. Since $G_{0}$ is the Green’s function of $-\Delta+\eta_{0}$ with the periodic boundary condition, the kernel $g_{0}$ is translation-invariant, i.e., $g_{0}(x,y)=g_{0}(x-y)$ . The wavelet-wavelet coefficients of $G_{\eta}$ at level $\ell$ take the form

[TABLE]

where $\widetilde{\psi}^{(\ell)}_{k}(z):=(g_{0}*\psi^{(\ell)}_{k})(z)$ . For a fixed diagonal of $D_{1}^{\ell}$ with $k_{2}=k_{1}+c$ for a constant $c$ , Eq. 4.6 states that the map from $\eta$ to $D^{\ell}_{1,k_{1},k_{1}+c}$ for all possible $k_{1}$ is simply a convolution with an addition of a term independent of $\eta$ , which can be simply represented by the Conv1d layer in Eq. 3.7. It is straightforward to extend the conclusion to $D^{(\ell)}_{j,k_{1},k_{2}}$ , $j=2,3$ and $A^{(L)}_{k_{1},k_{2}}$ .

When $|\eta(x)-\eta_{0}|$ is not small, one can account for the nonlinearities neglected in the perturbative analysis by using multiple convolutional layers and making use of nonlinear activation functions. In other words, it is natural to approximate the map $\eta\to C_{\eta}^{(\ell)}$ using a convolutional NN with enough layers and an appropriate window size [40, 30, 46].

Moreover, since the matrix $G_{\eta}$ is a symmetric matrix, $D_{1}^{(\ell)}$ and $A^{(L_{0})}$ are symmetric and $(D_{2}^{(\ell)})^{\mathsf{T}}=D_{3}^{(\ell)}$ . In the implementation, the symmetry is enforced by generating $D_{3}^{(\ell)}$ from $D_{2}^{(\ell)}$ , and replacing $D_{1}^{(\ell)}$ (or $A^{(L_{0})}$ ) by $\frac{1}{2}(D_{1}^{(\ell)}+(D_{1}^{(\ell)})^{\mathsf{T}})$ (or $\frac{1}{2}(A^{(L_{0})}+(A^{(L_{0})})^{\mathsf{T}})$ , respectively. Since the Schrödinger form considered in this section includes the periodic boundary condition, the convolutional layers are all implemented with periodic padding.

Numerical results.

The NN discussed above is implemented in Keras [7] (running on top of TensorFlow [1]). The parameters of the NN are initialized randomly from the normal distribution. The loss function is set to be the mean squared error

[TABLE]

where the exact solution, obtained by solving Eq. 4.2, is denoted as $u$ and the NN prediction as $u^{\mathrm{NN}}$ . ${N_{\mathrm{samples}}}$ denotes the number of samples. The NN is trained until convergence using the Nadam optimizer [12] with the learning rates equal to $10^{-3}$ for the 1D case and $10^{-4}$ for the 2D case. The batch size is set to be one percent of the number of training samples. The support of the scaling function $\varphi$ is chosen to be $2p=6$ . The number of levels $L-L_{0}$ in the wavelet transform is $6$ for the 1D case and $4$ for the 2D case.

The data set contains $5,000$ different $\eta$ and for each $\eta$ Eq. 4.2 is solved with $20$ randomly generated $f$ using the central difference scheme. Therefore, the number of training samples corresponds to the number of different $\{f_{ij}\}$ , rather than different $\{\eta_{i}\}$ . Half of the generated data is used for training data, while the other half is reserved for testing. The accuracy of the NN is measured by the relative error in the $\ell^{2}$ norm

[TABLE]

The training error ${\epsilon_{\mathrm{train}}}$ and test error ${\epsilon_{\mathrm{test}}}$ are calculated by averaging the relative error over all training and test samples, respectively. The number of parameters in the NN is denoted by ${N_{\mathrm{params}}}$ . The operator error ${\epsilon_{\mathrm{op}}}$ is calculated by averaging the relative $2$ -norm error of the matrix

[TABLE]

over samples of the exact inverse operator $G_{\eta}$ and its NN approximation $G_{\eta}^{\mathrm{NN}}$ .

For the 1D case, the domain $\Omega=[0,1]$ is discretized by a uniform Cartesian grid with $320$ points. The positive potential $\eta(x)$ is generated by (1) sampling independently from $\mathcal{N}(0,1)$ on a uniform grid with $40$ points, (2) interpolating to the $320$ -point grid via a Fourier interpolation, and (3) point-wise exponentiating followed by a factor of 10 scaling. The source term $f(x)$ is generated by sampling independently from $\mathcal{N}(0,1)$ . The results for different values of $\alpha$ (channel number) and $n_{\mathrm{cnn}}$ (layer number) are reported in Table 1. The best approximation of the operator, obtained with $\alpha=5$ and $n_{\mathrm{cnn}}=5$ , results in a test error of $4.7\times 10^{-3}$ and an operator error of $2.5\times 10^{-3}$ with only $3\times 10^{4}$ parameters. The operator error reported in Table 1 has been averaged among 100 different samples of $G_{\eta}$ . Two random samples from the test data are illustrated in Fig. 4 along with the NN prediction. A representative sample of the inverse operator $G_{\eta}$ and its NN approximation are displayed in Fig. 5

For the 2D case, the domain $\Omega=[0,1]^{2}$ is discretized with a $80\times 80$ uniform Cartesian mesh. The potential $\eta(x)$ is generated by (1) sampling independently from $\mathcal{N}(0,1)$ on a uniform mesh with $10\times 10$ points, (2) then interpolating to $80\times 80$ points via a Fourier interpolation, and (3) point-wise exponentiating followed by appropriate scaling. The source term is sampled point-wisely from a standard Gaussian distribution. When trained with $\alpha=11$ and $n_{\mathrm{cnn}}=5$ , the NN achieves a test error of $2.2\times 10^{-2}$ and an operator error of $4.2\times 10^{-3}$ with only $9.3\times 10^{5}$ parameters, as reported in Table 2. The operator error ${\epsilon_{\mathrm{op}}}$ estimate is computed by averaging the error among 10 distinct samples of the inverse operator $G_{\eta}$ . The values of $\eta$ and $f$ of a representative sample are displayed in Fig. 6, along with the NN prediction and the error.

4.2 Divergence form

The same NN architecture is applied to the Green’s functions of the divergence form

[TABLE]

with $\eta(x)\geq\eta_{0}>0$ along with the periodic boundary condition. Following the notations of Section 1,

[TABLE]

When $\eta(x)$ is close to a fixed $\eta_{0}>0$ , the operator can be decomposed as

[TABLE]

Since the operator $E_{\eta}$ is linearly dependent on $\eta$ , it is easy to check that the discussion for the Schrödinger form holds for the divergence form case as well.

Numerical results.

The parameter field $\eta(x)$ is generated in a way similar to the potential of the Schrödinger form, with the difference that the scaling factor is set to $1/5$ and an additive term of $0.5$ is applied point-wise to avoid the ill-conditioning of $G_{\eta}$ . The numerical results for different choices of $\alpha$ (channel number) and $n_{\mathrm{cnn}}$ (layer number) are summarized in Table 3. For example, a test error of $6.9\times 10^{-3}$ is achieved at $\alpha=9$ and $n_{\mathrm{cnn}}=5$ with $9.6\times 10^{4}$ parameters. Two random samples from the test data are illustrated in Fig. 7.

5 Radiative transfer equation with isotropic scattering

The radiative transfer equation (RTE) is a fundamental model for describing particle propagation, with applications in many fields, such as neutron transport in reactor physics [48], light transport in atmospheric radiative transfer [45], heat transfer [35], and optical imaging [34]. The steady-state RTE in the homogeneous scattering regime is

[TABLE]

where $\varphi(x,v)$ denotes the photon flux that depends on both space $x$ and angle $v$ , $f(x)$ is the light source, $\eta(x)$ is the scattering coefficient, and $\eta_{a}(x)$ is the physical absorption coefficient. In many applications, it is reasonable to assume $\eta_{a}(x)$ to be constant. Below, we focus on the most challenging case $\eta_{a}(x)\equiv 0$ .

The numerical solution to the RTE has been extensively studied using the Monte Carlo methods and various discretization schemes for the differential-integral formulation Eq. 5.1 of RTE. However, these approaches often suffer from the high-dimensionality and non-smoothness of the photon-flux $\varphi(x,v)$ . The recent numerical work in [9, 16, 50] follows the integral formulation by eliminating $\varphi(x,v)$ from the equation and keeping only $u(x)$ as unknown:

[TABLE]

where the operator $K_{\eta}$ is defined as

[TABLE]

The parameterized Green’s function operator for the steady-state RTE is then

[TABLE]

Since $K_{\eta}$ is a dense operator, forming ${{G_{\eta}}}$ following Eq. 5.4 is often computationally expensive. Instead, the meta-learning approach developed above allows for approximating the map from $\eta$ to ${{G_{\eta}}}$ directly.

Section 4 argues that the map $\eta\to C_{\eta}^{(\ell)}$ for the translation invariant operator can be represented by a convolutional NN. A key observation for the current setting is that the integral equation Eq. 5.2 can be extended to the whole domain by padding $f$ and $\eta$ with zero. As a result, the map from $\eta$ to $G_{\eta}$ can be represented by a convolutional NN with zero padding.

Numerical results.

The first test is concerned with the one-dimensional slab geometry, where the parameter $\eta$ varies only in the $x_{1}$ direction (i.e., constant in the $x_{2}$ and $x_{3}$ directions). For this geometry, the integral equation Eq. 5.2 reduces to

[TABLE]

where $x$ stands for only $x_{1}$ and the operator $K_{\eta}^{(1)}$ is defined as

[TABLE]

with the domain $\Omega=[0,1]$ . In the implementation, $[-x_{0},1+x_{0}]$ is discretized by a uniform Cartesian mesh with $N=320$ points, where $x_{0}>0$ is selected such that there are $300$ points in $\Omega$ . The scattering coefficient $\eta$ is generated in the same way for $\eta(x)$ in Section 4 followed by appropriate rescaling. The source term $f(x)$ , positive due to physical considerations, is generated by sampling independently from $\mathcal{U}(0,1)$ instead of $\mathcal{N}(0,1)$ and interpolated via Fourier interpolation. The values of $\eta$ and $f$ outside of $\Omega$ are set to be [math]. The results for different values of $\alpha$ (channel number) and $n_{\mathrm{cnn}}$ (layer number) are summarized in Table 4. A test error of $2.9\times 10^{-3}$ is achieved with as few as $3.4\times 10^{4}$ parameters with $\alpha=n_{\mathrm{cnn}}=5$ . Two representative examples from the test set are shown in Fig. 8.

The second test is concerned with the 2D RTE. The domain $\Omega=[-x_{0},1+x_{0}]^{2}$ is discretized with a uniform Cartesian grid with $80\times 80$ points, where $x_{0}$ is chosen such that there are $70\times 70$ points in $\Omega$ . The scattering coefficient is generated following the same way of $\eta$ in Section 4 for the 2D case, followed by an appropriate rescaling. The source term $f(x)$ is generated by sampling independently from $\mathcal{U}(0,1)$ instead of $\mathcal{N}(0,1)$ . The values of $\eta$ and $f$ outside of $\Omega$ are set to be [math]. Results reported in Table 5 show that by setting $\alpha=11$ and $n_{\mathrm{cnn}}=5$ , the NN can achieve a test error of $4.4\times 10^{-3}$ with as few as $1.3\times 10^{6}$ parameters. A representative sample from the test set is illustrated in Fig. 9.

6 Conclusions

This paper presented a meta-learning approach for learning the map from the equation parameter $\eta$ to the pseudo-differential solution operator $G_{\eta}$ . Motivated by the nonstandard wavelet form [5], the pseudo-differential operator is compressed to a collection of vectors. The nonlinear map from the parameter to this collection of vectors and the wavelet transform are learned hand-in-hand in the meta-learning approach. Numerical studies are carried out for the Green’s functions of elliptic PDEs as well as the radiative transfer equation.

This approach can be extended in several directions. First, this paper is only concerned with linear operators $G_{\eta}$ . This work can be readily extended to nonlinear operators if a simple compressed representation (such as the collection of vectors used here) can be identified. Second, the ConvNet module for the map $\eta\to C_{\eta}^{(\ell)}$ can be replaced with the recently proposed multiscale NNs [19, 17, 18], which are more effective for certain global-scale convolutions.

Acknowledgments

The work of Y.F. and L.Y. is partially supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through Advanced Computing (SciDAC) program. The work of J.F. is partially supported by Stanford Graduate Fellowship in Science & Engineering and by “la Caixa” Fellowship, sponsored by the “la Caixa” Banking Foundation of Spain under Fellowship LCF/BQ/AA16/11580045. The work of L.Y. is also partially supported by the National Science Foundation under award DMS-1818449. This work is also supported by the GCP Research Credits Program from Google and AWS Cloud Credits for Research program from Amazon.

Bibliography61

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for large-scale machine learning. In OSDI , volume 16, pages 265–283, 2016.
2[2] M. Araya-Polo, J. Jennings, A. Adler, and T. Dahlke. Deep-learning tomography. The Leading Edge , 37(1):58–66, 2018.
3[3] Y. Bengio, S. Bengio, and J. Cloutier. Learning a synaptic learning rule . Université de Montréal, Département d’informatique et de recherche opérationnelle., 1990.
4[4] J. Berg and K. Nyström. A unified deep artificial neural network approach to partial differential equations in complex geometries. Neurocomputing , 317:28–41, 2018.
5[5] G. Beylkin, R. Coifman, and V. Rokhlin. Fast wavelet transforms and numerical algorithms I. Communications on pure and applied mathematics , 44(2):141–183, 1991.
6[6] G. Carleo and M. Troyer. Solving the quantum many-body problem with artificial neural networks. Science , 355(6325):602–606, 2017.
7[7] F. Chollet et al. Keras. https://keras.io , 2015.
8[8] I. Daubechies. Orthonormal bases of compactly supported wavelets. Communications on pure and applied mathematics , 41(7):909–996, 1988.