Moderate Adaptive Linear Units (MoLU)
Hankyul Koh, Joon-hyuk Ko, Wonho Jhe

TL;DR
MoLU is a new smooth activation function for deep neural networks that improves convergence speed and accuracy, suitable for various architectures including LLMs, Neural ODEs, PINNs, and CNNs.
Contribution
Introduces MoLU, a mathematically elegant, infinitely differentiable activation function that enhances training efficiency and performance in diverse neural network models.
Findings
MoLU outperforms GeLU, SiLU, and Mish in convergence speed.
MoLU achieves higher final accuracy across tested architectures.
MoLU's smoothness mitigates vanishing and exploding gradient issues.
Abstract
We propose the Moderate Adaptive Linear Unit (MoLU), a novel activation function for deep neural networks, defined analytically as: f(x)=x \times (1+tanh(x))/2. MoLU combines mathematical elegance with empirical effectiveness, exhibiting superior performance in terms of prediction accuracy, convergence speed, and computational efficiency. Due to its C-infinity smoothness, i.e. infinite differentiability and analyticity, MoLU is expected to mitigate issues such as vanishing or exploding gradients, making it suitable for a broad range of architectures and applications, including large language models (LLMs), Neural Ordinary Differential Equations (Neural ODEs), Physics-Informed Neural Networks (PINNs), and Convolutional Neural Networks (CNNs). Empirical evaluations show that MoLU consistently achieves faster convergence and improved final accuracy relative to widely used activation…
| Comparison with outputs of some activation functions | |||||
|---|---|---|---|---|---|
| Input | GeLU | Swish | Mish | ELU() | MoLU() |
| -7 | -2.33146835e-15 | -6.37735836e-03 | -6.38026341e-03 | -9.99088118e-01 | -1.16414021e-05 |
| -6 | -8.43964898e-11 | -1.48357389e-02 | -1.48540805e-02 | -9.97521248e-01 | -7.37305482e-05 |
| -5 | -2.29179620e-07 | -3.34642546e-02 | -3.35762377e-02 | -9.93262053e-01 | -4.53999296e-04 |
| -4 | -7.02459482e-05 | -7.19448398e-02 | -7.25917408e-02 | -9.81684361e-01 | -2.68370062e-03 |
| -3 | -3.63739208e-03 | -1.42277620e-01 | -1.45647461e-0 | -9.50212932e-01 | -1.48723912e-02 |
| -2 | -4.54023059e-02 | -2.38405844e-01 | -2.52501483e-01 | -8.64664717e-01 | -7.32298040e-02 |
| -1 | -1.58808009e-01 | -2.68941421e-01 | -3.03401461e-01 | -6.32120559e-01 | -2.64248689e-01 |
| 0 | 0.00000000e+00 | 0.00000000e+00 | 0.00000000e+00 | 0.00000000e+00 | 0.00000000e+00 |
| 1 | 8.41191991e-01 | 7.31058579e-01 | 8.65098388e-01 | 1.00000000 | 1.00000000 |
| 2 | 1.95459769 | 1.76159416 | 1.94395896 | 2.00000000 | 2.00000000 |
| 3 | 2.99636261 | 2.85772238 | 2.98653500 | 3.00000000 | 3.00000000 |
| 4 | 3.99992975 | 3.92805516 | 3.99741281 | 4.00000000 | 4.00000000 |
| 5 | 4.99999977 | 4.96653575 | 4.99955208 | 5.00000000 | 5.00000000 |
| 6 | 6.00000000 | 5.98516426 | 5.99992663 | 6.00000000 | 6.00000000 |
| 7 | 7.00000000 | 6.99362264 | 6.99998838 | 7.00000000 | 7.00000000 |
| 8 | 8.00000000 | 7.99731720 | 7.99999820 | 8.00000000 | 8.00000000 |
| Train losses ()/Standard errors() | ||||||
| MoLU | GeLU | Mish | SiLU | ELU | Tanh | |
| Train loss | 2.02 | 2.85 | 3.69 | 1.77 | 4.47 | 7.03 |
| Std. err. | 2.74 | 11.57 | 21.02 | 1.17 | 5.41 | 5.16 |
| MNIST (Accuracy(%)) | ||||
|---|---|---|---|---|
| Tanh | ReLU | Leaky ReLU | MoLU | |
| 1 ep. | 65.52 % | 70.97 % | 71.28 % | 81.36 % |
| 2 ep. | 75.24 % | 87.44 % | 87.51 % | 89.80 % |
| 3 ep. | 82.08 % | 90.49 % | 90.52 % | 92.00 % |
| 4 ep. | 85.87 % | 91.59 % | 91.52 % | 93.09 % |
| 5 ep. | 88.21 % | 92.83 % | 92.82 % | 94.27 % |
| 10 ep. | 92.87 % | 95.40 % | 95.38 % | 96.10 % |
| 20 ep. | 95.35 % | 97.08 % | 97.05 % | 97.42 % |
| 30 ep. | 96.56 % | 97.68 % | 97.71 % | 97.97 % |
| 40 ep. | 97.14 % | 98.12 % | 98.11 % | 98.19 % |
| 50 ep. | 97.60 % | 98.27 % | 98.32 % | 98.38 % |
| CIFAR10 (Top-1 Accuracy) | ||||
| Tanh | ReLU | Leaky ReLU | MoLU | |
| 1 ep. | 47.28 % | 58.04 % | 63.95 % | 66.16 % |
| 2 ep. | 55.24 % | 73.96 % | 74.09 % | 73.84 % |
| 3 ep. | 59.47 % | 77.38 % | 77.17 % | 77.30 % |
| 4 ep. | 65.23 % | 79.36 % | 77.23 % | 77.91 % |
| 5 ep. | 66.98 % | 77.46 % | 78.80 % | 78.67 % |
| 10 ep. | 70.01 % | 80.51 % | 81.71 % | 80.58 % |
| 20 ep. | 70.41 % | 82.60 % | 83.00 % | 82.86 % |
| 30 ep. | 72.60 % | 83.20 % | 83.51 % | 82.97 % |
| CIFAR10 (Top-5 Accuracy | ||||
| Tanh | ReLU | Leaky ReLU | MoLU | |
| 1 ep. | 91.29 % | 95.25 % | 96.94 % | 97.30 % |
| 2 ep. | 94.48 % | 98.26 % | 98.49 % | 98.40 % |
| 3 ep. | 95.53 % | 98.65 % | 98.48 % | 98.67 % |
| 4 ep. | 96.67 % | 98.97 % | 98.54 % | 98.80 % |
| 5 ep. | 96.79 % | 98.67 % | 98.72 % | 98.62 % |
| 10 ep. | 97.35 % | 98.72 % | 98.84 % | 98.72 % |
| 20 ep. | 96.92 % | 98.66 % | 98.69 % | 98.63 % |
| 30 ep. | 97.02 % | 98.74 % | 98.73 % | 98.53 % |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and ELM
]Physics, College of Natural Sciences Dept. of Physics & Astronomy, Seoul National University, South Korea
]Physics, College of Natural Sciences Dept. of Physics & Astronomy, Seoul National University, South Korea
]Physics, College of Natural Sciences Dept. of Physics & Astronomy, Seoul National University, South Korea
Moderate Adaptive Linear Units (MoLU)
Hankyul Koh
[
Joon-hyuk Ko
[
Seoul National University Seoul National University
Wonho Jhe
[
Seoul National University
Abstract
We propose a new high-performance activation function, Moderate Adaptive Linear Units (MoLU), for the deep neural network. The MoLU, defined as : , is a simple, beautiful and powerful activation function that can be a good main activation function among hundreds of activation functions. Because the MoLU is made up of the elementary functions, not only it is a -diffeomorphism (i.e. smooth and infinitely differentiable over whole domains), but also it decreases training time.
††preprint: APS/123-QED
I Introduction
In NeuralODEs (Neural Ordinary Differential Equations), people used to use the ELU(Exponential Linear Units) [1] or the Tanh function for the activation functions, because of its differentiability over the whole domain. Prediction performance of NeuralODEs was not good for longer time-series data even short time-series datasets. People attributed this low performances to just the intrinsic property of NeuralODEs. Furthermore, NeuralODEs had a severe problem that it tooks very long time duration for learning. Our homotopy method did well not only be shorten the learning time duration, but also increase the accuracy of NeuralODEs. Although we have already achieved our work very successfully, we desired more both improving the accuracy and reducing the learning time duration. Now, our interest is directed to the detail part, not the whole system, the activation function.
GeLu(Gaussian error linear units) [2] was used to NeuralODEs in [3]. We used the GeLU in [4] also, because it showed better performance than the Tanh. However, we found that our activation function shows both higher performances and improved accuracies.
ReLU (Rectified Linear Units) is mainly used in the fields of vision classification. In classification, ordinarilly more layers is better in deep learning. On datasets such as MNIST, even simple CNNs with three layers can achieve high classification performance. However, for more challenging datasets such as CIFAR10, shallow networks have limitations. As is well known, ResNet won the 2015 ILSVRC competition with 152 layers. However, in [5], ResNet with more than 1000 layers deeper than 152 layers show the lower performance, which means that the higher number of layers is not necessarily better. We will show that our activation function is better than the ReLU for the classification problem also.
II MoLU Formulation
II.1 MoLU
The MoLU is a simple, beautiful and powerful activation function that consists of a combination of hyperbolic tangent and exponential functions. The slope of the MoLU in the negative integer region make it possible to escape from the local minima. On the other hand, in the positive integer region, the slope of the MoLU is almost always unity, so that it allows us to find the global minimum of a loss function in a stable manner like ReLU. The MoLU is defined as below in Eq. (1).
[TABLE]
II.2 Motivation
We realized that our activation function is not only showing a good performance for the accuracy but also converging to zero rapidly when updating a loss function during a test on some mathematical model and neural networks. We formulated our activation function in the following order. First, we used the hyperbolic tangent function as a basic framework, and then multiplied by the identity function to show the behavior of the identity function in the positive integer region. Lastly, we composited the exponential function to the hyperbolic tangent function to make it converge to zero asymptotically in the negative integer region. When we expanded the Mish[6] using a Taylor series, surprisingly, we happened to know that our activation function is related to the Mish.
[TABLE]
This wonderful coincidence inspired us why our activation function works well at a faster rate.
II.3 Comparison with other activation functions
Table 1 show that the obvious boundary between the linear part in the positive integer region and the non-linear part in the negative integer region than some other activation functions. Our activation function acts like the identity function, such as ReLU or ELU, in the positive integer region so that it does not lose any information, and its non-linearity in the negative integer region give it an ability to escape from the local minima.
III Experiment
The robust property of MoLU is to approach rapidly to the value of minimum of a loss function without losing stability. This is a truly useful characteristic when training long time-series data by using NeuralODEs(Neural Ordinary Differential Equations). To prove the performance of MoLU, we conducted experiment on NeuralODEs, MNIST, and CIFAR10. In NeuralODEs, the differentiable activation functions are mainly used, so we compared MoLU with GeLU, Mish, SiLU, ELU, Tanh, and in case of the classification, compared it with ReLU, Leaky ReLU and Tanh. We used .
III.1 NeuralODEs
We begin our experiment with Lotka-Volterra model, the commonly used simple model of NeuralODEs. Coefficients and initial conditions in the Lotka-Volterra equations were all identical to the setting in [4]. Following [3], we generated training data by numerically over the time span , then adding Gaussian noise with zero mean and standard deviation of the mean of each channel. We set 4,000 epochs for each experiment, the learning rate was 0.02, and random seeds were used for 10, 20, 30 for each activation function, and standard errors were used.
Fig. 3 and Fig. 4 show that our activation function is not only rapidly approaching the minimum of a loss function, but also showing a very stable performance.
III.2 MNIST
We conducted experiments with MNIST, the most commonly used datasets in the field of image classification. We used MNIST datasets in torchvision and used 2 layered Networks which is optimized using SGD on a batch size of 64 with a learning rate of 0.001 and a momentum of 0.5 with random seed of 10. We confirmed that our activation function shows a high-performance. Compared to other activation functions, our activation function clearly shows the characteristics of converging rapidly at the beginning of learning.
III.3 CIFAR10
We conducted experiment on CIFAR10 which is more challenging model than MNIST in classification fields. ResNet18 which is optimized using SGD on a batch size of 32 with a learning rate of 0.001, momentum of 0.9 is used for the experiment with random seed of 10. Our activation function converges rapidly with respect to the Top-1 accuracy and the Top-5 accuracy in Table3, Table4.
IV Conclusion
The MoLU showed a very good overall performance for the NeuralODEs and MNIST, CIFAR10 evaluated in this paper. The accuracies and converging rates exceeded other activation functions.
Acknowledgements.
This work was supported by grants from the National Research Foundation of Korea (No. 2016R1A3B1908660) to W. Jhe.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Clevert et al. [2016] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network learning by exponential linear units (elus), (2016), ar Xiv:1511.07289 .
- 2Hendrycks and Gimpel [2020] D. Hendrycks and K. Gimpel, Gaussian Error Linear Units (GEL Us) (2020), ar Xiv:1606.08415 .
- 3Kim et al. [2021] S. Kim, W. Ji, S. Deng, Y. Ma, and C. Rackauckas, Stiff Neural Ordinary Differential Equations, Chaos: An Interdisciplinary Journal of Nonlinear Science 31 , 093122 (2021).
- 4Ko et al. [2022] J.-H. Ko, H. Koh, N. Park, and W. Jhe, Homotopy-based training of Neural OD Es for accurate dynamics discovery (2022), ar Xiv:2210.01407 .
- 5He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2016) pp. 770–778.
- 6[6] D. Misra, Mish: A Self Regularized Non-Monotonic Neural Activation Function, .
