What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis
Weronika Ormaniec, Felix Dangel, Sidak Pal Singh

TL;DR
This paper provides a theoretical analysis of Transformer architectures by deriving and characterizing their Hessian, revealing how their unique non-linear dependencies influence their optimization landscape and distinguish them from classical neural networks.
Contribution
It offers the first complete derivation and analysis of the Transformer's Hessian, highlighting structural differences and data dependencies that explain its unique optimization challenges.
Findings
Transformers have highly non-linear, data-dependent Hessians.
Structural differences in the Hessian distinguish Transformers from classical networks.
These differences impact the optimization landscape and training dynamics.
Abstract
The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning--to the extent that, in comparison to MLPs/CNNs, Transformers are more often accompanied by adaptive optimizers, layer normalization, learning rate warmup, etc. The root causes behind these outward manifestations and the precise mechanisms that govern them remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures--grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer's Hessian…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsControl and Stability of Dynamical Systems · Magnetic Properties and Applications
MethodsDense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Attention Is All You Need · Linear Layer
