JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention
Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, Simon Du

TL;DR
JoMA introduces a mathematical framework that simplifies Transformer training analysis by focusing on MLP dynamics, revealing how attention mechanisms evolve from sparse to dense during training, and explaining token hierarchy formation.
Contribution
JoMA provides a novel analytical approach that removes previous unrealistic assumptions, accurately modeling attention dynamics and token hierarchy formation in multilayer Transformers.
Findings
Attention becomes sparse then dense during training with nonlinear activations.
JoMA's predictions align with empirical observations on real-world datasets.
The framework explains token hierarchy formation in Transformers.
Abstract
We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the self-attention layer in Transformers, producing a modified dynamics of MLP layers only. JoMA removes unrealistic assumptions in previous analysis (e.g., lack of residual connection) and predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations, while in the linear case, it is consistent with existing works that show attention becomes sparse over time. We leverage JoMA to qualitatively explains how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model. Experiments on models trained from real-world dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Evolutionary Algorithms and Applications · Advanced Memory and Neural Computing
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Label Smoothing · Absolute Position Encodings · Adam · Residual Connection · Layer Normalization · Softmax
