On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

Jianliang He; Leda Wang; Siyu Chen; Zhuoran Yang

arXiv:2602.16849·cs.LG·February 20, 2026

On the Mechanism and Dynamics of Modular Addition: Fourier Features, Lottery Ticket, and Grokking

Jianliang He, Leda Wang, Siyu Chen, Zhuoran Yang

PDF

Open Access 3 Reviews

TL;DR

This paper analyzes how two-layer neural networks learn modular addition, revealing the mechanisms of feature combination, phase symmetry, and frequency diversification, and explaining phenomena like grokking through a comprehensive theoretical framework.

Contribution

It provides a full mechanistic and theoretical explanation of feature learning, training dynamics, and generalization in neural networks solving modular addition, including the lottery ticket hypothesis and grokking.

Findings

01

Neurons learn Fourier features with phase alignment and frequency diversification.

02

Phase symmetry enables majority voting to cancel noise and identify correct sums.

03

Grokking is characterized as a three-stage process involving memorization and two generalization phases.

Abstract

We present a comprehensive analysis of how two-layer neural networks learn features to solve the modular addition task. Our work provides a full mechanistic interpretation of the learned model and a theoretical explanation of its training dynamics. While prior work has identified that individual neurons learn single-frequency Fourier features and phase alignment, it does not fully explain how these features combine into a global solution. We bridge this gap by formalizing a diversification condition that emerges during training when overparametrized, consisting of two parts: phase symmetry and frequency diversification. We prove that these properties allow the network to collectively approximate a flawed indicator function on the correct logic for the modular addition task. While individual neurons produce noisy signals, the phase symmetry enables a majority-voting scheme that cancels…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 5

Strengths

- This work addresses a question that the field currently considers to be of high importance: why do deep neural networks learn the features they learn on modular addition?

Weaknesses

*I am concerned with the paper overclaiming its novelty, particularly with respect to their claimed mechanistic interpretation*. This paper claims multiple results are novel, but I know some were done by other published papers. Furthermore, some results claimed as novel are in disagreement with results from other published papers. Thus, there are significant issues with this paper: 1. At least five prior works of high relevance aren't cited, which leads to 2. 2. There are **multiple cla

Reviewer 02Rating 0Confidence 3

Strengths

The paper explores an interesting set of questions. Highlights potentially interesting findings. Work around activation functions is intriguing.

Weaknesses

The paper introduces the conceptions `phase alignment, where a neuron’s output phase is twice its input phase, and phase symmetry, where phases are uniformly distributed among neurons sharing the same frequency.` however, before this introduction, the term `phase` is not concretely defined in this context, which makes this section hard to parse. The paper states on line `136` that `We begin with the most striking observation: a global trigonometric pattern in parameters that consistently emer

Reviewer 03Rating 2Confidence 3

Strengths

1. The paper adds interesting observations about the grokking phenomena in the context of the mod(a+b)23 task, which provide a nice insight into grokking on this dataset. 2. The combination of extensive theoretical results and supporting empirical results strengthens the paper's findings; however, some of the observations provided by the paper are corroborations of previous findings rather than novel insights (see weaknesses below). 3. The notion of the majority voting scheme is an interestin

Weaknesses

1. **Lottery Ticket Mechanism**: In the paper, observation 6 is positioned as a novel observation; however, prior work, namely [1], [2] explicitly mentions the role of internal structure at initialisation, via the Lottery Ticket Hypothesis (LTH) [3], being a primary factor in grokking. Furthermore, [2] even goes on to show that particular 'grokking tickets' reduce the time for generalisation to occur. I think that the 'Lottery Ticket Mechanism' you observe should be positioned as corroborating

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Ferroelectric and Negative Capacitance Devices · Machine Learning in Materials Science