Parallel Layer Normalization for Universal Approximation
Yunhao Ni, Yuxin Guo, Yuhe Liu, Wenxin Sun, Jie Luo, Wenjun Wu, Lei Huang

TL;DR
This paper demonstrates that neural networks with parallel layer normalization (PLN) layers can universally approximate functions, surpassing standard LN networks, with theoretical analysis and empirical validation across various architectures.
Contribution
It introduces PLN-Nets, a novel architecture that achieves universal approximation, and extends the analysis to RMSNorm and complex models like Transformers.
Findings
PLN-Nets achieve universal approximation.
Analysis of approximation rates in different norms.
Empirical evidence supports PLN-Nets' potential.
Abstract
This paper studies the approximation capabilities of neural networks that combine layer normalization (LN) with linear layers. We prove that networks consisting of two linear layers with parallel layer normalizations (PLNs) inserted between them (referred to as PLN-Nets) achieve universal approximation, whereas architectures that use only standard LN exhibit strictly limited expressive power.We further analyze approximation rates of shallow and deep PLN-Nets under the norm as well as in Sobolev norms. Our analysis extends beyond LN to RMSNorm, and from standard MLPs to position-wise feed-forward networks, the core building blocks used in RNNs and Transformers.Finally, we provide empirical experiments to explore other possible potentials of PLN-Nets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Advanced Graph Neural Networks
MethodsLayer Normalization
