From Attention to Activation: Unravelling the Enigmas of Large Language Models
Prannay Kaul, Chengcheng Ma, Ismail Elezi, Jiankang Deng

TL;DR
This paper investigates peculiar behaviors in large language models, such as first token dominance and large activations, proposing novel methods to mitigate these issues and improve quantisation performance.
Contribution
The paper introduces softmax-1 reformulation and OrthoAdam optimizer to address attention and activation anomalies in Transformers, enhancing quantisation robustness.
Findings
Attention to first token reduced from 65% to 3.3%.
Activation kurtosis decreased from 1657 to 3.1.
Perplexity penalty under 4-bit quantisation reduced from 3565 to 0.3.
Abstract
We study two strange phenomena in auto-regressive Transformers: (1) the dominance of the first token in attention heads; (2) the occurrence of large outlier activations in the hidden states. We find that popular large language models, such as Llama attend maximally to the first token in 98% of attention heads, a behaviour we attribute to the softmax function. To mitigate this issue, we propose a reformulation of softmax to softmax-1. Furthermore, we identify adaptive optimisers, e.g. Adam, as the primary contributor to the large outlier activations and introduce OrthoAdam, a novel optimiser that utilises orthogonal matrices to transform gradients, to address this issue. Finally, not only do our methods prevent these phenomena from occurring, but additionally, they enable Transformers to sustain their performance when quantised using basic algorithms, something that standard methods are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsAttention Is All You Need · Adam · Softmax · LLaMA
