JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and   Attention

Yuandong Tian; Yiping Wang; Zhenyu Zhang; Beidi Chen; Simon Du

arXiv:2310.00535·cs.LG·March 18, 2024

JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, Simon Du

PDF

Open Access 1 Repo 1 Video

TL;DR

JoMA introduces a mathematical framework that simplifies Transformer training analysis by focusing on MLP dynamics, revealing how attention mechanisms evolve from sparse to dense during training, and explaining token hierarchy formation.

Contribution

JoMA provides a novel analytical approach that removes previous unrealistic assumptions, accurately modeling attention dynamics and token hierarchy formation in multilayer Transformers.

Findings

01

Attention becomes sparse then dense during training with nonlinear activations.

02

JoMA's predictions align with empirical observations on real-world datasets.

03

The framework explains token hierarchy formation in Transformers.

Abstract

We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the self-attention layer in Transformers, producing a modified dynamics of MLP layers only. JoMA removes unrealistic assumptions in previous analysis (e.g., lack of residual connection) and predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations, while in the linear case, it is consistent with existing works that show attention becomes sparse over time. We leverage JoMA to qualitatively explains how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model. Experiments on models trained from real-world dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/luckmatters
pytorchOfficial

Videos

JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention· slideslive

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Evolutionary Algorithms and Applications · Advanced Memory and Neural Computing

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Label Smoothing · Absolute Position Encodings · Adam · Residual Connection · Layer Normalization · Softmax