From Attention to Activation: Unravelling the Enigmas of Large Language   Models

Prannay Kaul; Chengcheng Ma; Ismail Elezi; Jiankang Deng

arXiv:2410.17174·cs.CL·October 23, 2024

From Attention to Activation: Unravelling the Enigmas of Large Language Models

Prannay Kaul, Chengcheng Ma, Ismail Elezi, Jiankang Deng

PDF

Open Access

TL;DR

This paper investigates peculiar behaviors in large language models, such as first token dominance and large activations, proposing novel methods to mitigate these issues and improve quantisation performance.

Contribution

The paper introduces softmax-1 reformulation and OrthoAdam optimizer to address attention and activation anomalies in Transformers, enhancing quantisation robustness.

Findings

01

Attention to first token reduced from 65% to 3.3%.

02

Activation kurtosis decreased from 1657 to 3.1.

03

Perplexity penalty under 4-bit quantisation reduced from 3565 to 0.3.

Abstract

We study two strange phenomena in auto-regressive Transformers: (1) the dominance of the first token in attention heads; (2) the occurrence of large outlier activations in the hidden states. We find that popular large language models, such as Llama attend maximally to the first token in 98% of attention heads, a behaviour we attribute to the softmax function. To mitigate this issue, we propose a reformulation of softmax to softmax-1. Furthermore, we identify adaptive optimisers, e.g. Adam, as the primary contributor to the large outlier activations and introduce OrthoAdam, a novel optimiser that utilises orthogonal matrices to transform gradients, to address this issue. Finally, not only do our methods prevent these phenomena from occurring, but additionally, they enable Transformers to sustain their performance when quantised using basic algorithms, something that standard methods are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsAttention Is All You Need · Adam · Softmax · LLaMA