Mechanistic Insights into Grokking from the Embedding Layer

H.V.AlquBoj; Hilal AlQuabeh; Velibor Bojkovic; Munachiso Nwadike; Kentaro Inui

arXiv:2505.15624·cs.LG·May 22, 2025

Mechanistic Insights into Grokking from the Embedding Layer

H.V.AlquBoj, Hilal AlQuabeh, Velibor Bojkovic, Munachiso Nwadike, Kentaro Inui

PDF

Open Access

TL;DR

This paper reveals that embeddings are crucial for grokking in neural networks, identifying mechanisms like embedding dynamics and bilinear coupling, and proposes adaptive learning rates to improve training efficiency.

Contribution

The study uncovers the role of embeddings in grokking, analyzes underlying mechanisms, and introduces adaptive learning rate strategies to enhance convergence in Transformers.

Findings

01

Embedding dynamics cause delayed generalization in MLPs.

02

Bilinear coupling introduces saddle points affecting training.

03

Adaptive learning rates mitigate bilinear effects, speeding up convergence.

Abstract

Grokking, a delayed generalization in neural networks after perfect training performance, has been observed in Transformers and MLPs, but the components driving it remain underexplored. We show that embeddings are central to grokking: introducing them into MLPs induces delayed generalization in modular arithmetic tasks, whereas MLPs without embeddings can generalize immediately. Our analysis identifies two key mechanisms: (1) Embedding update dynamics, where rare tokens stagnate due to sparse gradient updates and weight decay, and (2) Bilinear coupling, where the interaction between embeddings and downstream weights introduces saddle points and increases sensitivity to initialization. To confirm these mechanisms, we investigate frequency-aware sampling, which balances token updates by minimizing gradient variance, and embedding-specific learning rates, derived from the asymmetric…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdhesion, Friction, and Surface Interactions · Vibration and Dynamic Analysis

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Softmax