Understanding Transformers via N-gram Statistics
Timothy Nguyen

TL;DR
This paper investigates how transformer-based language models can be approximated by simple N-gram statistics, revealing insights into their training dynamics, overfitting detection, and prediction behaviors.
Contribution
It introduces a method to approximate transformers with N-gram rules, providing new tools for understanding training progress and model predictions.
Findings
N-gram rules can approximate 79% of TinyStories and 68% of Wikipedia next-token predictions.
A new method to detect overfitting without a holdout set is proposed.
Transformers tend to follow N-gram rules as they become more complex during training.
Abstract
Transformer based large-language models (LLMs) display extreme proficiency with language yet a precise understanding of how they work remains elusive. One way of demystifying transformer predictions would be to describe how they depend on their context in terms of simple template functions. This paper takes a first step in this direction by considering families of functions (i.e. rules) formed out of simple N-gram based statistics of the training data. By studying how well these rulesets approximate transformer predictions, we obtain a variety of novel discoveries: a simple method to detect overfitting during training without using a holdout set, a quantitative measure of how transformers progress from learning simple to more complex statistical rules over the course of training, a model-variance criterion governing when transformer predictions tend to be described by N-gram rules, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
