Grokking of Diffusion Models: Case Study on Modular Addition

Joon Hyeok Kim; Yong-Hyun Park; Mattis Dals{\ae}tra {\O}stby; Jiatao Gu

arXiv:2604.17673·cs.LG·April 21, 2026

Grokking of Diffusion Models: Case Study on Modular Addition

Joon Hyeok Kim, Yong-Hyun Park, Mattis Dals{\ae}tra {\O}stby, Jiatao Gu

PDF

TL;DR

This paper investigates how diffusion models generalize and perform modular addition, revealing their internal mechanisms and the process of delayed generalization known as grokking.

Contribution

It provides a mechanistic analysis of diffusion models' internal computations during modular addition, highlighting how they bridge symbolic reasoning and pixel-space generation.

Findings

01

Models implement modular addition via compositional representations.

02

Iterative sampling separates arithmetic computation from denoising.

03

Grokking occurs with delayed generalization after overfitting.

Abstract

Despite their empirical success, how diffusion models generalize remains poorly understood from a mechanistic perspective. We demonstrate that diffusion models trained with flow-matching objectives exhibit grokking--delayed generalization after overfitting--on modular addition, enabling controlled analysis of their internal computations. We study this phenomenon across two levels of data regime. In a single-image regime, mechanistic dissection reveals that the model implements modular addition by composing periodic representations of individual operands. In a diverse-image regime with high intraclass variability, we find that the model leverages its iterative sampling process to partition the task into an arithmetic computation phase followed by a visual denoising phase, separated by a critical timestep threshold. Our work provides the mechanistic decomposition of algorithmic learning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.