A distributional simplicity bias in the learning dynamics of transformers

Riccardo Rende; Federica Gerace; Alessandro Laio; and Sebastian Goldt

arXiv:2410.19637·cs.CL·October 2, 2025

A distributional simplicity bias in the learning dynamics of transformers

Riccardo Rende, Federica Gerace, Alessandro Laio, and Sebastian Goldt

PDF

Open Access 1 Video

TL;DR

This paper reveals that transformer models trained on natural language data exhibit a simplicity bias, learning low-order token interactions first and progressing to higher-order interactions, which explains their effective generalization.

Contribution

The study introduces a method to generate data clones capturing token interactions up to a certain order, demonstrating transformers' sequential learning of interaction complexities.

Findings

01

Transformers learn low-order interactions first.

02

They reach saturation in low-order interaction errors.

03

Higher-order interactions continue to be learned after saturation.

Abstract

The remarkable capability of over-parameterised neural networks to generalise effectively has been explained by invoking a ``simplicity bias'': neural networks prevent overfitting by initially learning simple classifiers before progressing to more complex, non-linear functions. While simplicity biases have been described theoretically and experimentally in feed-forward networks for supervised learning, the extent to which they also explain the remarkable success of transformers trained with self-supervised techniques remains unclear. In our study, we demonstrate that transformers, trained on natural language data, also display a simplicity bias. Specifically, they sequentially learn many-body interactions among input tokens, reaching a saturation point in the prediction error for low-degree interactions while continuing to learn high-degree interactions. To conduct this analysis, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

A distributional simplicity bias in the learning dynamics of transformers· slideslive

Taxonomy

TopicsNeural Networks and Applications