Adaptive activation functions accelerate convergence in deep and   physics-informed neural networks

Ameya D. Jagtap; George Em Karniadakis

arXiv:1906.01170·physics.comp-ph·January 29, 2020·J. Comput. Phys.

Adaptive activation functions accelerate convergence in deep and physics-informed neural networks

Ameya D. Jagtap, George Em Karniadakis

PDF

TL;DR

This paper introduces adaptive activation functions in neural networks, significantly improving convergence speed and accuracy in solving various linear and nonlinear partial differential equations, including forward and inverse problems.

Contribution

The authors propose a scalable hyper-parameter for activation functions that adapts during training, enhancing learning capabilities and solution accuracy in physics-informed neural networks.

Findings

01

Faster convergence in training neural networks for PDEs.

02

Improved accuracy in approximating solutions of nonlinear PDEs.

03

Effective in both forward and inverse problem settings.

Abstract

We employ adaptive activation functions for regression in deep and physics-informed neural networks (PINNs) to approximate smooth and discontinuous functions as well as solutions of linear and nonlinear partial differential equations. In particular, we solve the nonlinear Klein-Gordon equation, which has smooth solutions, the nonlinear Burgers equation, which can admit high gradient solutions, and the Helmholtz equation. We introduce a scalable hyper-parameter in the activation function, which can be optimized to achieve best performance of the network as it changes dynamically the topology of the loss function involved in the optimization process. The adaptive activation function has better learning capabilities than the traditional one (fixed activation) as it improves greatly the convergence rate, especially at early training, as well as the solution accuracy. To better understand…

Tables7

Table 1. Table 1: Burgers equation: Relative L 2 subscript 𝐿 2 L_{2} error after 2000 iterations with different values of a 𝑎 a with clean data.

	$a = 1$	Variable $a, (n = 1)$	Variable $a, (n = 5)$	Variable $a, (n = 10)$
Relative $L_{2}$ error	1.913973e-01	1.170261e-01	9.928848e-02	9.517134e-02

Table 2. Table 2: Klein-Gordon equation: Relative L 2 subscript 𝐿 2 L_{2} error after 1400 iterations with different values of a 𝑎 a with clean data.

	$a = 1$	Variable $a, (n = 1)$	Variable $a, (n = 5)$	Variable $a, (n = 10)$
Relative $L_{2}$ error	1.953597e-01	1.026246e-01	9.528848e-02	9.064256e-02

Table 3. Table 3: Sine-Gordon equation: Percentage relative L 2 subscript 𝐿 2 L_{2} error in λ 1 = 1 , λ 2 = 1 , λ 3 = 1 formulae-sequence subscript 𝜆 1 1 formulae-sequence subscript 𝜆 2 1 subscript 𝜆 3 1 \lambda_{1}=1,\lambda_{2}=1,\lambda_{3}=1 and λ 4 = 4 subscript 𝜆 4 4 \lambda_{4}=4 for different N u subscript 𝑁 𝑢 N_{u} points using the adaptive activation function after 3500 iterations.

	$N_{u} = 50$			$N_{u} = 200$			$N_{u} = 500$
	Clean	1% noise	2% noise	Clean	1% noise	2% noise	Clean	1% noise	2% noise
$λ_{1}$	3.28550	1.439983	3.218234	2.50646	0.831884	3.420329	1.04115	0.297201	1.896238
$λ_{2}$	1.15104	2.333641	2.443504	2.04701	2.199578	1.212937	1.24181	1.380813	0.351852
$λ_{3}$	0.50759	0.120240	0.495362	2.61289	1.729769	3.965372	0.64331	0.680488	1.020026
$λ_{4}$	1.08709	1.030558	1.178467	1.77241	1.530927	1.755536	0.92602	1.200277	1.376885

Table 4. Table 4: Sine-Gordon equation: Comparison of percentage relative L 2 subscript 𝐿 2 L_{2} error in λ 1 = 1 , λ 2 = 1 , λ 3 = 1 formulae-sequence subscript 𝜆 1 1 formulae-sequence subscript 𝜆 2 1 subscript 𝜆 3 1 \lambda_{1}=1,\lambda_{2}=1,\lambda_{3}=1 and λ 4 = 4 subscript 𝜆 4 4 \lambda_{4}=4 obtained after 3500 iterations for the fixed and the adaptive activation functions using N u = 500 subscript 𝑁 𝑢 500 N_{u}=500 points.

	$N_{u} = 500$ (fixed activation)			$N_{u} = 500$ (adaptive activation)
	Clean	1% noise	2% noise	Clean	1% noise	2% noise
$λ_{1}$	6.25785	10.22437	7.876128	1.04115	0.297201	1.896238
$λ_{2}$	6.22453	7.21347	5.720103	1.24181	1.380813	0.351852
$λ_{3}$	8.99409	3.74553	9.965611	0.64331	0.680488	1.020026
$λ_{4}$	3.55075	4.54000	3.60425	0.92602	1.200277	1.376885

Table 5. Table 5: Sine-Gordon equation: Identification of two-dimensional sine-Gordon equation with the fixed activation function after 3500 iterations.

Correct PDE	$4 s (x, y) - u_{x x} - u_{y y} + s i n (u) = 0$
Identified PDE (Clean data)	$(3.85797) s (x, y) - (0.93743) u_{x x} - (0.93776) u_{y y} + (0.9100591) \sin (u) = 0$
Identified PDE (1 % noise)	$(3.81840) s (x, y) - (0.89776) u_{x x} - (0.92790) u_{y y} + (0.9625447) \sin (u) = 0$
Identified PDE (2 % noise)	$(3.85583) s (x, y) - (0.92124) u_{x x} - (0.94280) u_{y y} + (0.9003439) \sin (u) = 0$

Table 6. Table 6: Sine-Gordon equation: Identification of two-dimensional sine-Gordon equation using the adaptive activation function with scaling parameter n = 5 𝑛 5 n=5 after 3500 iterations.

Correct PDE	$4 s (x, y) - u_{x x} - u_{y y} + s i n (u) = 0$
Identified PDE (Clean data)	$(3.96296) s (x, y) - (1.01041) u_{x x} - (0.98759) u_{y y} + (1.03064331) \sin (u) = 0$
Identified PDE (1 % noise)	$(3.95199) s (x, y) - (1.00297) u_{x x} - (0.98619) u_{y y} + (0.9931951) \sin (u) = 0$
Identified PDE (2 % noise)	$(3.94492) s (x, y) - (0.98104) u_{x x} - (0.99648) u_{y y} + (0.9897997) \sin (u) = 0$

Table 7. Table 7: Sine-Gordon equation: Comparison of relative L 2 subscript 𝐿 2 L_{2} error in the solution obtained after 3500 iterations for the fixed and the adaptable activation functions.

	Clean data	1 % error	2 % error
Fixed activation function	1.923454e-03	1.996804e-03	2.396984e-03
Adaptive activation function with $n = 5$	9.93412e-04	1.420345e-03	1.986455e-03

Equations67

L_{k} (x^{k - 1}) : = w^{k} x^{k - 1} + b^{k},

L_{k} (x^{k - 1}) : = w^{k} x^{k - 1} + b^{k},

u_{Θ} (x) = (L_{k} \circ σ \circ L_{k - 1} \circ \dots \circ σ \circ L_{1}) (x),

u_{Θ} (x) = (L_{k} \circ σ \circ L_{k - 1} \circ \dots \circ σ \circ L_{1}) (x),

J (Θ) = M S E_{F} + M S E_{u}

J (Θ) = M S E_{F} + M S E_{u}

M S E_{F}

M S E_{F}

w^{*}

w^{*}

w^{m + 1} = w^{m} - η_{l} \nabla_{w} J^{m} (w)

w^{m + 1} = w^{m} - η_{l} \nabla_{w} J^{m} (w)

σ (a L_{k} (x^{k - 1})),

σ (a L_{k} (x^{k - 1})),

a^{*} = a \in R^{+} \ {0} arg min (J (a)) .

a^{*} = a \in R^{+} \ {0} arg min (J (a)) .

a^{m + 1} = a^{m} - η_{l} \nabla_{a} J^{m} (a) .

a^{m + 1} = a^{m} - η_{l} \nabla_{a} J^{m} (a) .

Sigmoid :

Sigmoid :

Hyperbolic tangent :

ReLU :

Leaky ReLU :

σ (na L_{k} (x^{k - 1})) .

σ (na L_{k} (x^{k - 1})) .

u_{\tilde{Θ}} (x) = (L_{k} \circ σ \circ na L_{k - 1} \circ σ \circ na L_{k - 2} \circ \dots \circ σ \circ na L_{1}) (x) .

u_{\tilde{Θ}} (x) = (L_{k} \circ σ \circ na L_{k - 1} \circ σ \circ na L_{k - 2} \circ \dots \circ σ \circ na L_{1}) (x) .

J (\tilde{Θ}) = \frac{1}{N _{f}} i = 1 \sum N_{f} ∣ F (x_{f}^{i}, y_{f}^{i}, t_{f}^{i}) ∣^{2} + \frac{1}{N _{u}} i = 1 \sum N_{u} ∣ u^{i} - u (x_{u}^{i}, y_{u}^{i}, t_{u}^{i}) ∣^{2} .

J (\tilde{Θ}) = \frac{1}{N _{f}} i = 1 \sum N_{f} ∣ F (x_{f}^{i}, y_{f}^{i}, t_{f}^{i}) ∣^{2} + \frac{1}{N _{u}} i = 1 \sum N_{u} ∣ u^{i} - u (x_{u}^{i}, y_{u}^{i}, t_{u}^{i}) ∣^{2} .

\tilde{Θ}^{*} = arg min (\tilde{Θ}) .

\tilde{Θ}^{*} = arg min (\tilde{Θ}) .

J (\tilde{Θ}) = \frac{1}{N _{u}} i = 1 \sum N_{u} ∣ u^{i} - u (x_{u}^{i}, y_{u}^{i}, t_{u}^{i}) ∣^{2} .

J (\tilde{Θ}) = \frac{1}{N _{u}} i = 1 \sum N_{u} ∣ u^{i} - u (x_{u}^{i}, y_{u}^{i}, t_{u}^{i}) ∣^{2} .

u (x) = (x^{3} - x) \frac{sin ( 7 x )}{7} + sin (12 x), x \in [- 3, 3] .

u (x) = (x^{3} - x) \frac{sin ( 7 x )}{7} + sin (12 x), x \in [- 3, 3] .

u_{t} + u u_{x} = \tilde{ϵ} u_{xx}, x \in [- 1, 1], t > 0

u_{t} + u u_{x} = \tilde{ϵ} u_{xx}, x \in [- 1, 1], t > 0

u (x) = {0.2 sin (6 x) 1 + 0.1 x cos (12 x) If x \leq 0 Otherwise.

u (x) = {0.2 sin (6 x) 1 + 0.1 x cos (12 x) If x \leq 0 Otherwise.

F : = (u_{N N})_{t} + u_{N N} (u_{N N})_{x} - \tilde{ϵ} (u_{N N})_{xx},

F : = (u_{N N})_{t} + u_{N N} (u_{N N})_{x} - \tilde{ϵ} (u_{N N})_{xx},

u_{tt} + α Δ u + N (u) = h (x, t), x \in [- 1, 1], t > 0,

u_{tt} + α Δ u + N (u) = h (x, t), x \in [- 1, 1], t > 0,

F : = u_{tt} - Δ u_{N N} - N (u_{N N})

F : = u_{tt} - Δ u_{N N} - N (u_{N N})

Δ u + k^{2} (u) = q (x, y), (x, y) \in [- 1, 1]^{2}

Δ u + k^{2} (u) = q (x, y), (x, y) \in [- 1, 1]^{2}

q (x, y) = 2 π cos (π y) sin (π x) + 2 π cos (π x) sin (π y) + (x + y) sin (π x) sin (π y) - 2 π^{2} (x + y) sin (π x) sin (π y)

q (x, y) = 2 π cos (π y) sin (π x) + 2 π cos (π x) sin (π y) + (x + y) sin (π x) sin (π y) - 2 π^{2} (x + y) sin (π x) sin (π y)

u (x, y) = (x + y) sin (π x) sin (π y) .

u (x, y) = (x + y) sin (π x) sin (π y) .

u_{tt} + N [u; λ] = 0

u_{tt} + N [u; λ] = 0

f (x, y)

f (x, y)

g (x, y)

\frac{\partial u}{\partial x}

\frac{\partial u}{\partial x}

\frac{\partial u}{\partial y}

u (x, y, t) = 4 tan^{- 1} (exp (x + y - t)) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Adaptive activation functions accelerate convergence in deep and physics-informed neural networks

Ameya D. Jagtapa, George Em Karniadakisa,b,∗

aDivision of Applied Mathematics, Brown University, 182 George Street, Providence, RI 02912, USA. bPacific Northwest National Laboratory, Richland, WA 99354, USA.

Abstract

We employ adaptive activation functions for regression in deep and physics-informed neural networks (PINNs) to approximate smooth and discontinuous functions as well as solutions of linear and nonlinear partial differential equations. In particular, we solve the nonlinear Klein-Gordon equation, which has smooth solutions, the nonlinear Burgers equation, which can admit high gradient solutions, and the Helmholtz equation. We introduce a scalable hyper-parameter in the activation function, which can be optimized to achieve best performance of the network as it changes dynamically the topology of the loss function involved in the optimization process. The adaptive activation function has better learning capabilities than the traditional one (fixed activation) as it improves greatly the convergence rate, especially at early training, as well as the solution accuracy. To better understand the learning process, we plot the neural network solution in the frequency domain to examine how the network captures successively different frequency bands present in the solution. We consider both forward problems, where the approximate solutions are obtained, as well as inverse problems, where parameters involved in the governing equation are identified. Our simulation results show that the proposed method is a very simple and effective approach to increase the efficiency, robustness and accuracy of the neural network approximation of nonlinear functions as well as solutions of partial differential equations, especially for forward problems.

keywords:

Machine learning, Deep neural networks, Inverse problems, Physics-informed neural network, Burgers equation, Klein-Gordon equation, Helmholtz equation.

††journal: Journal

1 Introduction

Neural networks (NNs) have found applications in the context of numerical solution of partial differential equations, integro-differential equations and dynamical systems. Since, a neural network is an universal approximator, thus it is natural to consider the neural network space as an ansatz space of the solution of governing equation. In [23, 25, 16], NNs are successfully used to obtain the approximate solution of partial diferential equations (PDEs). One can also construct the physics-informed learning machine, which makes use of systematically structured prior information about the solution. In the earlier study, Owhadi [19] showed the promising approach of exploiting such prior information to construct physics-informed learning machines for the numerical homogenization problem. Raissi et al [26, 27] employed Gaussian process regression to obtain representation of functionals of linear operators, which not only infer the solution accurately but also provide uncertainty estimates for many physical problems. Further, their method was extended to nonlinear problems by Raissi and Karniadakis [23] and Raissi et al [24] in the context of solution inference and system identification. Data-driven turbulence modeling has been developed by Wang et al. [33] and Duraisamy & co-workers in a series of papers [39, 20, 12]. Physics-informed neural networks (PINNs) can accurately solve both forward problems, where the approximate solutions of governing equations are obtained, as well as inverse problems, where parameters involved in the governing equation are obtained from the training data [25]. In the PINN algorithm, along with the contribution from the neural network the loss function is enriched by the addition of residual term from the governing equation(s), which act as a penalizing term that constrains the space of admissible solutions.

Figure 1 gives a sketch of a PINN algorithm for the Klein-Gordon equation where one can see the neural network along with the supplementary physics-informed part. The loss function is evaluated using the contribution from the neural network part as well as the residual from the governing equation given by physics-informed part. Then, one seeks the optimal values of weights $(w)$ and biases $(b)$ in order to minimize the loss function below certain tolerance $\epsilon$ or until a prescribed maximum number of iterations.

The activation function plays an important role in such training process due to the dependence of the derivative of loss function on optimization parameters, which, in turn, depends on the derivative of the activation function. In the PINN algorithm various activation functions such as tanh, sin etc are used to solve various problems, see [25, 28] for more details. There is no obvious choices for the activation function since it solely depends on the problem at hand. To tackle this issue, various methods are proposed in the literature like adaptive sigmoidal activation function for multilayer feedforward NNs proposed by Yu et al [38], while Qian et al [21] focuses on learning activation functions in convolutional NNs via combining basic activation functions in a data-driven way. Dushkoff and Ptucha [13] proposed multiple activation functions per neuron, where individual neurons select between a multitude of activation functions. A tunable activation function is proposed by Li et al. [17], where only a single hidden layer is used and the activation function is tuned. In [32], Shen et al. used a similar idea of tunable activation function but with multiple outputs.

In this paper the activation function is tuned for any number of hidden layers by introducing an adaptable hyper-parameter with a scaling factor. Along with the deep neural network problem, PINN-based forward and inverse problems involving smooth solutions (like, nonlinear Klein-Gordon equation and Helmholtz equation) as well as steep-gradient solution (Burgers equation) are solved using the proposed method and compared with the fixed activation function. One can clearly see the advantages like increase in the accuracy and fast convergence rate, using the proposed adaptive activation function, especially in the early training period.

This paper is organized as follows: After the introduction in section 1, section 2 gives a brief discussion of the proposed methodology, where we also discuss about training data, loss function, optimization methods and the proposed method. Section 3 gives the results and detailed discussions on deep neural network approximations of smooth and discontinuous functions as well as PINN based solutions of forward/inverse problems, including the Burgers, Klein-Gordon and Helmholtz equations. Finally, in section 4, we summarize the conclusions of our work.

2 Methodology

We consider a NN of depth $D$ corresponding to a network with an input layer, $D-1$ hidden layers and an output layer. In the $k^{th}$ hidden layer, $N_{k}$ number of neurons are present. Each hidden layer of the network receives an output $x^{k-1}\in\mathbb{R}^{N_{k-1}}$ from the previous layer where an affine transformation of the form

[TABLE]

is performed. The network weights $w^{k}$ and bias term $b^{k}$ associated with the $k^{th}$ layer are chosen from independent and identically distributed (iid) samplings. The nonlinear activation function $\sigma(\cdot)$ is applied to each component of the transformed vector before sending it as an input to the next layer. The activation function is an identity function after an output layer. Thus, the final neural network representation is given by the composition

[TABLE]

where the operator $\circ$ is the composition operator, $\boldsymbol{\Theta}=\{w^{k},b^{k}\}_{k=1}^{D}$ represents the trainable parameters in the network, $u$ is the output and $x^{0}=x$ is the input.

2.1 Training data

In the supervised learning, training data is important to train the neural network, which can be obtained from the exact solution (if available) or from high-resolution numerical solution using methods like spectral method, discontinuous Galerkin method etc, as per the problem at hand. Here we select the training points either randomly from the uniform/normal distribution or one can also choose these points depending upon the small length and time scales (high gradient regions) present in the solution space. Training data can also be obtained from carefully performed experiments, which may yield both high- and low-fidelity data sets.

2.2 Loss function and optimization algorithm

We aim to find the optimal weights for which the suitably defined loss function is minimized. In PINN the loss function is defined as

[TABLE]

where the mean squared error (MSE) is given by

[TABLE]

Here $\{x_{f}^{i},y_{f}^{i},t^{i}_{f}\}_{i=1}^{N_{f}}$ denotes the residual training points in the space-time domain and $\{x_{u}^{i},y_{u}^{i},t^{i}_{u}\}_{i=1}^{N_{u}}$ denotes the boundary/initial training data. The aim of including the first term is that the neural network solution must satisfy the governing equation at randomly chosen points in the domain, which constitutes the physics-informed part of neural network whereas the second term includes the known boundary/initial conditions, which must be satisfied by the neural network solution. The resulting optimization problem leads to finding the minimum of a loss function by optimizing the parameters, i.e., we seek to find

[TABLE]

One can approximate the solutions to this minimization problem iteratively by one of the forms of gradient descent algorithm. The stochastic gradient descent (SGD) algorithm is widely used in machine learning community, see [30] for a survey. In SGD the weights are updated as

[TABLE]

where $\eta_{l}>0$ is the learning rate and $J^{m}$ is the loss function at $m^{th}$ iteration. SGD methods can be initialized with some starting value $w^{0}$ . In this work, the ADAM optimizer [15], which is a variant of the SGD method is used in all problems.

2.3 Adaptive activation functions

In the literature, various activation functions are available such as sigmoid or logistic, tanh. ReLU, Leaky-ReLU etc [14]. The role of activation function is to decide whether particular neuron should fire or not. When the activation function is absent the weights and bias would simply do a linear transformation, which is a case of a linear regression model. Such linear model is simple to solve but is limited in its capacity to solve complex problems. The nonlinear activation function performs the nonlinear transformation to the input data making it capable to learn and perform more complex tasks. In the back-propagation algorithm, the evaluation of gradient of the loss function involves the gradient of the activation function. Activation functions make the back-propagation possible since the gradients are supplied to update the weights and biases; hence, without the differentiable nonlinear activation function, this would not be possible. Thus, one must choose an activation function which is less prone to the vanishing and the exploding gradient problem [6]. In this work, we have tried different activation functions, and their performance is compared in the results and discussion section.

The size of a network is correlated to its capacity to reproduce complicated functions. A deep network is required to solve complex problems, which on the other hand is difficult to train. In most cases, a suitable architecture is selected based on the researcher’s experience, which is a trial and error based method. One can think of tuning the network to get the best performance out of it. To achieve this, we introduce the hyper-parameter $a$ in the activation function as

[TABLE]

which needs to be optimized. The resulting optimization problem leads to finding the minimum of a loss function by optimizing $a$ along with the weights and biases, i.e., we seek to find

[TABLE]

The parameter $a$ is updated as

[TABLE]

Mathematically, such hyper-parameter can change the slope of the activation function, which is one of the important aspects of the neural network training. Figure 2 shows the sigmoid, tanh, ReLU and Leaky-ReLU activation functions with different values of hyper-parameter $a$ , where we can see the changes in slope of the activation function with $a$ . The corresponding expressions of these activation function are given by

[TABLE]

The learning rate has a great impact while searching for global minima. Large learning rate can over-shoot the global minima whereas small learning rate can increase the computational cost although it slowly moves towards global minima. The common practice is to use small learning rate for such optimization problem, which gives slow variation or say slow convergence towards the optimized value. Intuitively, one can think of some scale factor $n\geq 1$ multiplied by $a$ , which accelerates convergence towards global minima. Thus, the final form of the activation function is given by

[TABLE]

It is important to note that the introduction of the scalable hyper-parameter does not change the structure of loss function defined previously. Then, the final adaptive activation function based neural network representation is given by

[TABLE]

In this case, the trainable parameters are $\tilde{\boldsymbol{\Theta}}=\{w^{k},b^{k},a\}_{k=1}^{D}$ . Compared to the original neural network, the adaptive activation function based PINN has one additional scalable hyper-parameter $a$ to train. The adaptive activation function based PINN algorithm is summarized as follows.

3 Results and discussion

In this section, first we shall approximate nonlinear smooth and discontinuous functions using neural networks with fixed and adaptive activations. Subsequently, we will solve Burgers, Klein-Gordon and Helmholtz equations, which can admit both continuous as well as high gradient solutions using PINNs with fixed and adaptive activations. Both forward problems, where the solution is inferred, as well as inverse problems, where the parameters involved in the governing equation are obtained using the proposed adaptive activation function based PINN. To show the effectiveness of the proposed method, various comparisons are made. In this study, the optimal value of learning rate is found by experiments to be 0.0008.

3.1 Neural network approximation of nonlinear smooth and discontinuous functions

In this test case we are using a standard NN (without physics-informed part) to approximate smooth and discontinuous functions. In this case the loss function consists of just a neural network part given by

[TABLE]

The activation function is tanh and the number of hidden layers is four with 50 neurons in each layer.

Here, we shall consider two smooth functions and one discontinuous function. First, the smooth function is given by

[TABLE]

In this function both high and low frequency components are present. The number of training points $N_{u}$ used is 300. Next, we consider the Burgers equation given by

[TABLE]

with initial condition $u(x,0)=-\sin(\pi x)$ , boundary conditions $u(-1,t)=u(1,t)=0$ and $\tilde{\epsilon}=0.01/\pi$ . The analytical solution can be obtained using the Hopf-Cole transformation, see Basdevant et al., [3] for more details. We consider the smooth solution of Burgers equation at time $t=0.25$ to be approximated by a deep neural network. The number of training points used is 256.

Finally, the following discontinuous function with discontinuity at $x=0$ location is approximated by a deep neural network.

[TABLE]

Here, the domain is $[-4,~{}3.75]$ . In this case, the number of training points used is 300.

Knowledge of training process of neural network is important in order to optimize this process. In the earlier study, Arpit, et al. [1] suggested that the neural network learns simple pattern first before memorizing. Similar results have been found by Rahaman, et al. [22], who suggested that the neural network learns low frequencies first. They showed that the amplitude of each frequency component of the network output is controlled by the spectral norm of the small-size network, which increases gradually during the training process. Thus, longer training time allows the network to learn complex functions by allowing it to capture the high frequencies components in the solution. Another such tool is the frequency principle or F-principle proposed by Xu, et al., see [36] and the references therein. In this case, the evolution of solution is observed in frequency domain by taking the Fourier transform $F[u]$ of a solution, where one can see that the neural network captures the low frequency first during the training process and then captures high frequency components. In this work, we shall use the F-principle to observe the performance of adaptive activation function.

Figure 3 shows the smooth solution given by fixed (top row) and adaptive activation (middle row) functions. The first column (top and middle rows) shows the solutions whereas the second column (top and middle rows) shows the solution in frequency domain. The adaptive activation function captures all frequencies in 22000 iterations as opposed to fixed activation. This behavior is also reflected in their solution plots. The bottom left figure shows the loss function comparison for fixed and adaptive activation with scaling factor 10. The loss is decreasing faster in case of adaptive activation and the optimal value of $a$ is around 4.5 as shown in the bottom right figure. The neural network tries to capture the low frequency first and then high frequency components, which can be seen from the solution as well as from the corresponding frequency plots.

Figure 4 shows the smooth Burgers solution given by fixed (top row) and adaptive activation (middle row) functions. The first column (top and middle rows) shows the solutions whereas the second column (top and middle rows) shows the solution in frequency domain. We observe that even after 80000 iterations we are unable to capture all frequencies in the solution. Nonetheless, the adaptive activation function captures the frequencies faster than the fixed activation. The bottom left figure shows the loss function comparison for fixed and adaptive activation with scaling factor 10. The loss is decreasing faster in the case of adaptive activation and the optimal value of $a$ is around 1.56 as shown in the bottom right figure.

Finally, figure 5 shows the discontinuous solution given by fixed (top row) and adaptive activation (middle row) functions. The first column (top and middle rows) shows the solutions whereas the second column (top and middle rows) shows the solution in frequency domain. The adaptive activation function captures all frequencies in 28000 iterations whereas the fixed activation function fails to capture these frequencies, which can be also be seen from their solution plots. The bottom left figure shows the loss function comparison for fixed and adaptive activation with $n=10$ . The loss is decreasing faster in the case of adaptive activation. Finally, the bottom right figure gives the variation of $na$ , which gives the optimal value around 5.86. In this case, the neural network first captures the discontinuity and then low frequency components present in $x<0$ region and finally captures the high frequencies present in $x>0$ region.

3.2 Burgers equation

The Burgers equation is one of the fundamental partial differential equation arising in various fields such as nonlinear acoustics, gas dynamics, fluid mechanics etc, see Whitham [35] for more details. The Burgers equation was first introduced by H. Bateman [4] and later studied by J.M. Burgers [8] in the context of theory of turbulence. The inviscid Burgers equation is a nonlinear first-order hyperbolic partial differential equation, which can develop discontinuities even when the initial condition is sufficiently smooth. Here, in the case of vanishing viscosity, we shall consider Burgers equation given by equation (4) along with its initial and boundary conditions. The nonlinearity in the convection term develops very steep solution due to small $\tilde{\epsilon}$ value.

The number of training data points on the boundary is 400, the number of residual training points is 10000 and

[TABLE]

where $u_{NN}$ represents solution given by NN. The neural network architecture used for the computation consists of six hidden layers with 20 neurons in each layer.

Figure 6 shows the contour plot of the solution of the Burgers equation and figure 7 shows the comparison of exact and PINN solutions of Burgers equation with $\tilde{\epsilon}=0.01/\pi$ . In the top figure, the fixed activation function without any tuning parameters is used where the value of $a$ is unity. The middle and bottom figures present the results of adaptive activation function, where $a$ is a variable and scaling factors $n=1$ and 5 are used in the respective figures. By introducing the adjustable parameter, the accuracy of the solution improves. Figure 8 (left) shows the variation of the loss function with epochs, where we can see the effect of hyper-parameter introduced in the activation function. For variable $a$ , the loss function converges faster with increasing scaling factor as compared to fixed activation function.

Figure 8 (right) shows the optimization process for variable $a$ . For unity scaling factor the optimization process of $a$ is slow due to small learning rate. One can increase the speed of tuning process further by increasing the scaling factor as shown in figure.

Table 1 gives the relative $L_{2}$ error for fixed as well as adaptive activation function, where it can be observed that the error decreases with increasing scaling factor.

To show the comparison of fixed and adaptive activation functions we plotted the neural network solution of Burgers equation at $t=0.25,0.5$ and $0.75$ (column-wise) in frequency domain and compared their results. Figure 9 shows the results for ReLU activation function. The first row shows the fixed activation function whereas the second row gives result for the adaptive activation. In both cases, activation is unable to capture the frequencies present in the solution. Thus, ReLU is not a good activation function for this problem. Next, we used ’tanh’ activation and figure 10 shows the corresponding results. In the case of adaptive activation function, the process of capturing the correct frequency is faster than the fixed activation function for $t=0.5$ and $0.75$ . From the middle figures (top and bottom), the adaptive activation function captures almost all frequencies compared to fixed activation in a short duration. Similar behavior can be observed from the right figures (top and bottom) where dominant frequencies are captured in just 1000 iterations using the adaptive activation shown by blue curve (dash-dot line). Figure 11 shows the results for ’sin’ activation function where a similar trend can be seen in the solution for both activation functions at $t=0.5$ and $0.75$ . In the case of smooth solution at $t=0.25$ all activation functions fail to capture the frequencies present in the solution. This is clear from our previous example of approximating the smooth Burgers solution at $t=0.25$ using NN requiring a large number of iterations to capture all frequencies present in the solution.

The introduction of a scalable hyper-parameter in the activation function dynamically changes the topology of the loss function thereby achieving faster convergence towards global minima. Figure 12 (top row) shows the initial and final optimized ’tanh’ (left) and ’sin’ (right) activation functions. The corresponding activation planes ( $na-x$ plane) and activation surfaces are given by the middle and bottom rows, respectively. In both cases we observe that the gradient of the activation function increases from its initial stage in the direction of large values of $na$ , which contributes towards the fast learning process of the neural network.

3.2.1 Initialization of scaled hyper-parameter and effect of large scaling factor

Initialization of the scaled hyper-parameter can be done in various ways as long as such value does not cause divergence of the loss. In this work the scaled hyper-parameter is initialized as $na=1,~{}\forall n\geq 1$ . Although, increase in scaling factor speeds up the convergence rate, at the same time the parameter $a$ becomes more sensitive. This can be seen from the oscillations in the loss function as well as in the values of $a$ , figure 8 (right). The reason behind this is that the SGD optimization algorithm becomes very sensitive, hence we should not use a large scaling factor which may eventually cause the solution to diverge. To overcome this difficulty we can think of adding a regularization term in the loss function. Although it partially suppresses the oscillations in the loss function, the other fatal effect of this addition is deterioration of convergence rate, which also reduces the accuracy of the solution. We can also add this regularization term after some iterations (instead of initially), which may nullify the effect of slower convergence rate but still we may not get rid of oscillations fully. Moreover, finding the correct value of regularization weight is still based on trial and error analysis.

3.3 Klein-Gordon equation

The nonlinear Klein-Gordon equation is a second-order hyperbolic partial differential equation arising in many scientific fields like soliton dynamics and condensed matter physics [9], solid state physics, quantum field theory and nonlinear optics [34], nonlinear wave equations [11] etc. The inhomogeneous Klein-Gordon equation is given by

[TABLE]

where initial conditions and boundary condition are extracted from the exact solution given by $u(x,t)=x\cos(t)$ . $\Delta$ is a Laplacian operator and $N(u)=\beta u+\gamma u^{k}$ is the nonlinear term with quadratic nonlinearity ( $k=2$ ) and cubic non-linearity ( $k=3$ ); $\alpha,\beta,\gamma$ are constants.

Now, let us consider the one-dimensional test case where the computational domain is $[-1,\,1]$ , the initial conditions are $f(x)=x,~{}g(x)=0$ with $\alpha=-1,\beta=0,\gamma=1$ , $k=2$ and $h(x,t)=-x\cos(t)+x^{2}\cos^{2}(t)$ . The number of training data points on the boundary is 500, the number of residual training points is 10000 and

[TABLE]

The neural network architecture used for the computation consists of two hidden layers with 40 neurons in each layer. Figure 13 shows the contour plot of the solution of Klein-Gordon equation and figure 14 shows the comparison of exact and PINN solutions with fixed activation function (top) and adaptive activation function with variable $a$ and $n=1$ (middle) and $n=5$ (bottom). There is no difference in solution for $n=5$ and $10$ , hence plotted the solution for $n=5$ only. In these figures we can see that the solution is improving with increasing $n$ . Figure 15 (left) shows the loss function with epochs. Again, convergence is faster for the adaptive activation function with increasing $n$ . Also, the value of $a$ converges to optimized value, which is nearly 1.548 with increasing scaling factor as shown in figure 15 (right).

We plotted the solution at $x=-0.5$ and $0.5$ (column-wise) in frequency domain and compared the results of fixed (top row) and adaptive activation (bottom row) functions as shown in figure 16. The adaptive activation function captures all frequencies contained in the solution much faster than the fixed activation function. In both cases, all frequencies are captured in 400 iterations (approximately) by the adaptive activation whereas the fixed activation takes more than 1500 iterations.

3.4 Helmholtz equation

The Helmholtz equation is one of the fundamental equations of mathematical physics arising in many physical problems like vibrating membranes, acoustics, electromagnetism equations, etc, see book by Sommerfeld [31] for more details. In two dimensions it is given by

[TABLE]

with homogeneous Dirichlet boundary conditions. The forcing term is given by

[TABLE]

The exact solution for $k=1$ is

[TABLE]

In this case, the number of hidden layers is two and in each layer 40 neurons are used. The residual training points are 16000 whereas the $N_{u}$ points are 400. Figure 17 shows the comparison of exact solution with that of PINN solution using fixed and adaptive $(n=10)$ activation functions at three different location given by the black dash line in the contour plot. The relative $L_{2}$ error calculated at the end of 3600th iteration is 1.0591e-1 using the fixed activation function and 7.1945e-2 in the case of adaptive activation. We can also see the solution in the frequency domain given by figure 18 using fixed (1st row) and variable $a,n=10$ (2nd row) ’tanh’ activation function. We can observe that the adaptive activation function captures frequencies faster than the fixed one in both locations $x=-0.5$ (first column) and $x=0.5$ (second column). Finally, figure 19 (left) shows the loss function comparison where we see the fast convergence of the adaptive activation as compared to the fixed activation, and figure 19 (right) shows the variation of $a$ with number of iterations where the optimal value of $a$ is close to 3.

3.5 Inverse problem for two-dimensional sine-Gordon equation

An inverse problem is formulated as estimating the function $u\in\mathcal{U}$ from the data $s\in\mathcal{S}$ where $s=T(u)+noise$ . Here $\mathcal{U,S}$ are topological vector spaces and $T:\mathcal{U}\rightarrow\mathcal{S}$ gives a mapping of a given function which gives rise to data in the absence of noise. Machine learning when applied to inverse problems can be framed as the problem of reconstructing a nonlinear mapping $T^{*}:\mathcal{S}\rightarrow\mathcal{U}$ such that it satisfies the pseudo-inverse property as $T^{*}(s)\approx u$ whenever data $s$ is related to $u$ . An important step in machine learning approaches is to parameterize the pseudo-inverse operator, and then the learning phase refers to choosing optimal parameters using some training data by minimizing a suitably defined loss function which gives the learned pseudo-inverse operator $T^{*}_{\tilde{\boldsymbol{\Theta}}}$ . The training data are iid realizations of a $\mathcal{U}\times\mathcal{S}$ -valued random variable $(u,s)$ with known probability density function. This data is rich enough to allow a machine learning scheme to identify the structure of nonlinear mapping $T$ given by the governing physical laws.

There are various approches proposed in the literature for a data-driven discovery of governing differential equations, for example, Raissi and Karniadakis [23] as well as Raissi et al. [27] in the context of Gaussian process, Rudy et al [30] proposed a sparse regression which is based on library of candidate terms and sparse model selection to select the important terms involved in the governing equation and recently by Berg and Nyström [7]. One of the efficient machine learning based approach for solving an inverse problem given by Raissi, et al. [25] is the data-driven discovery of partial differential equations by writing the governing equation as

[TABLE]

where $\mathcal{N}[\cdot]$ contains parameterized linear/nonlinear terms. The network’s job is to identify the unknown parameters $\lambda=\{\lambda_{1},\lambda_{2},\cdots\}$ as well as to obtain a qualitatively accurate reconstruction of the given solution. Here, we shall consider a two-dimensional sine-Gordon test case to solve the inverse problem where $N(u)=\phi~{}\sin(u)$ . In this test case $\phi=-1$ and the domain is $-7\leq x,y\leq 7$ . The initial conditions are given as

[TABLE]

and boundary conditions are

[TABLE]

which has the analytical solution

[TABLE]

In the case of the inverse problem, the loss function is the same as given by equation (2) but the MSE is given by

[TABLE]

Here $\{x_{u}^{i},y_{u}^{i},t^{i}_{u}\}_{i=1}^{N_{u}}$ denotes the training data points from the boundary as well as from inside the domain. The loss $MSE_{u}$ corresponds to training data on the solution $u(x,y,t)$ whereas $MSE_{\mathcal{F}}$ enforces the governing equation on the same set of training data points. For our analysis, 50, 200 and 500 training data points, which are randomly selected points are used. The neural network architecture consists of four hidden layers with 20 neurons in each layer.

For careful scrutinization of the performance of the proposed approach, we vary the noise level in the training data. Given the noisy measurements of the data $(x^{i},y^{i},t^{i},u^{i})$ , we are interested in learning the parameters $\lambda=\{\lambda_{1},\lambda_{2},\cdots\}$ . We define $\mathcal{F}$ as

[TABLE]

where the function $s(x,y)$ is given by

[TABLE]

and $\lambda_{1}=\lambda_{2}=\lambda_{3}=1,\lambda_{4}=4$ .

Table 3 shows the percentage relative $L_{2}$ error in $\lambda$ ’s for adaptive activation function using 50, 200 and 500 training data points. We observe that the error decreases with increase in training points. Moreover, in some $\lambda$ ’s the error in the solution with $1\%$ noise is slightly less than that of clean data. Figure 20 shows the solution of PINN using adaptive activation function for different $N_{u}$ points, which accurately approximates the solution in the domain. Also, comparison of the PINN solution with the exact solution is considered at $x=0,2.8$ and 5.6 locations. Table 4 shows the comparison of percentage relative $L_{2}$ error in $\lambda$ ’s using fixed and adaptive activation function after 3500 iterations. In the case of a fixed activation the maximum error is almost 9% in clean data whereas it is just 1.24% in adaptive activation function. This shows that the adaptive activation increases the accuracy of the solution. Table 5 shows the correct PDE along with the identified PDEs with clean data, 1% and 2% noise using the fixed activation function whereas table 6 shows the corresponding PDEs using the adaptive activation function, which is more accurate than its fixed counterpart. From the results we observe that the neural network is able to correctly identify the unknown parameters involved in the governing equations with very good accuracy even in the presence of noise in the training data. To quantify the accuracy we have given the relative $L_{2}$ error in table 7, which is again small for the adaptive activation function based PINN. Finally, the loss function using fixed and adaptive activation function is plotted for the three different data sets as shown in figure 21, which shows that the loss given by adaptive activation function based PINN converges faster.

4 Conclusions

Increasing the performance of deep learning algorithm is important in order to design fast and accurate machine learning techniques. By introducing the scalable hyper-parameter in the activation function, not only convergence of the neural network increases but also better accuracy is obtained. Thus, we can achieve a better performance of the neural network by the introduction of such parameter. To support our claim, various forward and inverse problems are solved using deep neural networks and physics-informed neural networks with smooth solution (given by Klein-Gordon equation, Helmholtz equation) as well as high gradient solution (given by the Burgers equation) in one and two dimensions. In all cases, it is shown that the decay in loss function is faster in the case of adaptive activation function, and correspondingly the relative $L_{2}$ error in solution is shown to be small in the proposed approach. In order to investigate the performance of such scalable hyper-parameter based adaptive activation function, the neural network solution is plotted in a frequency domain, revealing the capturing of the frequencies in the solution faster than fixed activation function. The proposed approach can be used both in PINNs as well as in standard neural networks and is a promising and simple approach to increase the efficiency, robustness and accuracy of the neural network based approximation of nonlinear functions as well as the solution of partial differential equations, especially for forward problems.

Acknowledgement

This work was supported by the Department of Energy PhILMs grant DE-SC0019453, and by the DAPRA-AIRA grant HR00111990025.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] D. Arpit, et al., A closer look at memorization in deep networks, ar Xiv preprint ar Xiv:1706.05394, 2017.
2[2] A.R. Barron, Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. Inform. Theory, 39(3), 930-945, 1993.
3[3] C. Basdevant, et al., Spectral and finite difference solution of the Burgers equation, Comput. Fluids, 14 (1986) 23-41.
4[4] H. Bateman, Some recent researches on the motion of fluids, Monthly Weather Review, 43(4), 163-170, 1915.
5[5] A.G. Baydin, B.A. Pearlmutter, A.A. Radul, J.M. Siskind, Automatic differentiation in machine learning: a survey, Journal of Machine Learning Research, 18 (2018) 1-43.
6[6] Y. Bengio, Simard P. and Frasconi P., Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, 5(2), 157-166, 1994.
7[7] J. Berg, K. Nyström , Data-driven discovery of PD Es in complex datasets, J. Comput. Phys. 384 (2019) 239-252.
8[8] J.M. Burgers,A mathematical model illustrating the theory of turbulence. In advances in applied mechanics, Vol. 1, pp. 171-199, 1948.