What Does It Mean to Be a Transformer? Insights from a Theoretical   Hessian Analysis

Weronika Ormaniec; Felix Dangel; Sidak Pal Singh

arXiv:2410.10986·cs.LG·March 18, 2025

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis

Weronika Ormaniec, Felix Dangel, Sidak Pal Singh

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper provides a theoretical analysis of Transformer architectures by deriving and characterizing their Hessian, revealing how their unique non-linear dependencies influence their optimization landscape and distinguish them from classical neural networks.

Contribution

It offers the first complete derivation and analysis of the Transformer's Hessian, highlighting structural differences and data dependencies that explain its unique optimization challenges.

Findings

01

Transformers have highly non-linear, data-dependent Hessians.

02

Structural differences in the Hessian distinguish Transformers from classical networks.

03

These differences impact the optimization landscape and training dynamics.

Abstract

The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning--to the extent that, in comparison to MLPs/CNNs, Transformers are more often accompanied by adaptive optimizers, layer normalization, learning rate warmup, etc. The root causes behind these outward manifestations and the precise mechanisms that govern them remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures--grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer's Hessian…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dalab/transformer-hessian
jaxOfficial

Videos

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis· slideslive

Taxonomy

TopicsControl and Stability of Dynamical Systems · Magnetic Properties and Applications

MethodsDense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Attention Is All You Need · Linear Layer