Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale

James A. Michaelov; Roger P. Levy; Benjamin K. Bergen

arXiv:2510.24963·cs.CL·October 30, 2025

Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale

James A. Michaelov, Roger P. Levy, Benjamin K. Bergen

PDF

TL;DR

This study demonstrates that autoregressive language models, regardless of architecture, dataset, or scale, follow a consistent behavioral trajectory during pretraining, primarily driven by simple statistical heuristics.

Contribution

It reveals that language model behavior over training can be largely explained by unigram, n-gram probabilities, and semantic similarity, showing a universal pattern across models.

Findings

01

Behavioral phases are consistent across architectures, datasets, and scales.

02

Up to 98% of behavior variance explained by simple heuristics.

03

Models' probabilities overfit to n-gram probabilities as training progresses.

Abstract

We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the $n$ -gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words' $n$ -gram probabilities for increasing $n$ over the course of training.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.