Explaining grokking through circuit efficiency

Vikrant Varma; Rohin Shah; Zachary Kenton; J\'anos Kram\'ar; Ramana; Kumar

arXiv:2309.02390·cs.LG·September 6, 2023·2 cites

Explaining grokking through circuit efficiency

Vikrant Varma, Rohin Shah, Zachary Kenton, J\'anos Kram\'ar, Ramana, Kumar

PDF

Open Access 3 Reviews

TL;DR

This paper explains the grokking phenomenon in neural networks by proposing that it results from the interplay of more efficient generalising circuits and less efficient memorising circuits, with implications for dataset size and training dynamics.

Contribution

It introduces a novel explanation for grokking based on circuit efficiency, predicts new behaviors like ungrokking and semi-grokking, and provides empirical evidence supporting this theory.

Findings

01

Grokking occurs when generalising circuits are more efficient than memorising ones.

02

A critical dataset size exists where memorisation and generalisation are equally efficient.

03

Discovered phenomena include ungrokking and semi-grokking behaviors.

Abstract

One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation. We propose that grokking occurs when the task admits a generalising solution and a memorising solution, where the generalising solution is slower to learn but more efficient, producing larger logits with the same parameter norm. We hypothesise that memorising circuits become more inefficient with larger training datasets while generalising circuits do not, suggesting there is a critical dataset size at which memorisation and generalisation are equally efficient. We make and confirm four novel predictions about grokking, providing significant evidence in favour of our explanation. Most strikingly, we demonstrate two novel and surprising behaviours: ungrokking, in which a network…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The grokking phenomenon being studied in this paper is very puzzling and important. 2. This paper provides simple and intuitive arguments that can partially explain grokking. 3. The importance of the three ingredients and dataset size is validated by experiments. 4. The explanation provided in the paper also leads to the discovery of the "ungrokking" and "semi-grokking" phenomena.

Weaknesses

1. Although the paper claims that they provide a "theory" for grokking, there are no real theorems in the main paper. Many key concepts, such as circuit efficiency, are not defined with formal math, either. I encourage the authors to spend more effort to formulate and present their intuitive arguments with rigorous math. 2. Although the explanation provided by the paper seems intuitive, several key puzzles are still left unexplained, even if we follow the authors' argument with 3 ingredients. Th

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

* The paper is well written and easy to follow. * The story is in general sound and nicely supported by empirical results * Enrich the literature of grokking by discovering semi-grokking and un-grokking

Weaknesses

* Although I find the general story to be believable, some details are either incomplete or could have alternative explanations. See the question part.

Reviewer 03Rating 3· reject, not good enoughConfidence 5

Strengths

- The writing is clear and well-structured and the authors laid out an interesting story explaining grokking. - The paper tackles a very interesting phenomenon that can shed light on the dynamics of representation learning. - The experiments are clean, and the visualizations are informative.

Weaknesses

- On the empirical side, there are few results beyond modular arithmetic. - On the theoretical side, there is a focus on the phenomenological observation that generalizing circuits are learned at a different speed compared to memorizing circuits, but the theory offers no explanation as to why they are slow in the first place. I think the real question is not whether or not the generalizing circuit is slower but rather *why* it is slower in this particular way, i.e., why does the model generaliz

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Stochastic Gradient Optimization Techniques · Advanced Neural Network Applications