Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

Mingyu Jin; Kai Mei; Wujiang Xu; Mingjie Sun; Ruixiang Tang; Mengnan Du; Zirui Liu; Yongfeng Zhang

arXiv:2502.01563·cs.CL·May 22, 2025

Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang

PDF

Open Access 1 Repo

TL;DR

This paper reveals that massive values in self-attention modules are crucial for understanding contextual knowledge in large language models, influenced by Rotary Positional Encoding, and impacts model interpretability and design.

Contribution

It uncovers the emergence and significance of massive attention values in Q and K, linked to RoPE, and their role in contextual understanding rather than parametric knowledge retrieval.

Findings

01

Massive values consistently emerge in specific attention regions.

02

Ignoring massive values reduces performance on contextual tasks.

03

Rotary Positional Encoding causes the concentration of massive values.

Abstract

Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs (Q, K, and V mean the representations output by the query, key, and value layers respectively). Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model's parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mingyuj666/rope_with_llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInnovative Teaching and Learning Methods · Educational Strategies and Epistemologies · Intelligent Tutoring Systems and Adaptive Learning

MethodsSoftmax · Attention Is All You Need