Provable Benefits of Sinusoidal Activation for Modular Addition
Tianlong Huang, Zhiyuan Li

TL;DR
This paper demonstrates that sinusoidal activation functions enable neural networks to efficiently learn modular addition with minimal width, outperforming ReLU networks in expressivity and generalization, supported by theoretical bounds and empirical validation.
Contribution
It introduces the first sharp expressivity gap showing sine networks' advantages in modular addition, along with novel generalization bounds and empirical evidence of superior performance.
Findings
Sine MLPs can exactly realize modular addition with width 2.
ReLU networks require width scaling linearly with input length.
Sine networks exhibit better generalization and length extrapolation.
Abstract
This paper studies the role of activation functions in learning modular addition with two-layer neural networks. We first establish a sharp expressivity gap: sine MLPs admit width- exact realizations for any fixed length and, with bias, width- exact realizations uniformly over all lengths. In contrast, the width of ReLU networks must scale linearly with to interpolate, and they cannot simultaneously fit two lengths with different residues modulo . We then provide a novel Natarajan-dimension generalization bound for sine networks, yielding nearly optimal sample complexity for ERM over constant-width sine networks. We also derive width-independent, margin-based generalization for sine networks in the overparametrized regime and validate it. Empirically, sine networks generalize consistently better than ReLU networks across regimes and exhibit…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Strong theoretical framework with novel contributions. The paper makes rigorous theoretical contributions across multiple fronts. The Natarajan-dimension analysis (Theorem 5.9) is novel and broadly applicable, covering piecewise-polynomial, trigonometric-polynomial, and rational-exponential activations in a unified framework through elegant pairwise reduction techniques. The margin-based analysis for sine networks (Theorem 6.2) elegantly exploits the natural geometric properties of sinusoidal
1. Incomplete comparison with ReLU network. The ReLU margin bound in Theorem 6.3 requires extraordinarily stringent conditions that may not be achievable in practice. Specifically, the theorem requires width $d \ge \frac{64,p^{m}}{m^{2}+2},4.67^{m}$ and normalized margin $\gamma_{\theta,\mathrm{ReLU}}=\Omega!\left(\frac{1}{\sqrt{p}}\cdot\frac{1}{m^{1.5m+2.5},6.34^{m}}\right)$, which involves exponential dependence on $m$ that becomes prohibitively large even for moderate values. For example, wit
I believe that analyzing activation functions with periodic properties for periodic inputs is fair, natural, and meaningful research. The results presented in the paper appear to be novel in terms of originality, and the rigor of the proofs seems sufficient based on what I have seen.
1. There are several issues with the presentation of this paper. The meanings of the theorems and the logical flow of the proofs in both the main text and the appendix are not adequately explained. For example, Theorem 5.9 is obtained by appropriately bounding the presented Natarajan dimension and substituting it into an existing theorem. However, readers cannot find this logical flow in the main text before reading the proof. In addition, the appendix provides insufficient high-level explanatio
The paper presents clear and elegant theoretical results—the constructive proof that a width-2 sine MLP can exactly compute modular addition is both strong and intuitive. The generalization analysis is technically solid: the Natarajan-dimension and margin results are well-executed and connect naturally to established learning-theory tools. The empirical findings are also consistent with the theory, as the experiments replicate the same assumptions (shared embeddings, identical optimizers and dat
- The experimental and modeling setup is fairly restricted. The model operates only on bag-of-tokens count vectors without sequence order and is limited to a two-layer MLP. This design isolates periodicity effectively but omits much of the structure that real-world models, such as Transformers, rely on. As a result, it remains unclear how the theoretical advantages would carry over when tokens are contextualized or order-dependent. - The study also focuses exclusively on modular addition. While
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Neural Networks and Applications · Ferroelectric and Negative Capacitance Devices
