How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

Shawn Im; Changdae Oh; Zhen Fang; Sharon Li

arXiv:2601.19208·cs.CL·May 14, 2026

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

Shawn Im, Changdae Oh, Zhen Fang, Sharon Li

PDF

1 Video

TL;DR

This paper investigates how transformers learn semantic associations like 'bird' and 'flew' by analyzing training dynamics and deriving closed-form expressions for weights that explain the emergence of these associations.

Contribution

It introduces a leading-term gradient approximation to explain how semantic associations develop in transformers and provides a mechanistic, interpretable model of this process.

Findings

01

Closed-form expressions for transformer weights at early training stages.

02

Weights are compositions of bigram, token-interchangeability, and context functions.

03

Theoretical weight models closely match those learned by real-world LLMs.

Abstract

Semantic associations such as the link between "bird" and "flew" are foundational for language modeling as they enable models to go beyond memorization and instead generalize and generate coherent text. Understanding how these associations are learned and represented in language models is essential for connecting deep learning with linguistic theory and developing a mechanistic foundation for large language models. In this work, we analyze how these associations emerge from natural language data in attention-based language models through the lens of training dynamics. By leveraging a leading-term approximation of the gradients, we develop closed-form expressions for the weights at early stages of training that explain how semantic associations first take shape. Through our analysis, we reveal that each set of weights of the transformer has closed-form expressions as simple compositions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability· slideslive