Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

Qingyue Yang; Jie Wang; Xing Li; Yinqi Bai; Xialiang Tong; Huiling Zhen; Jianye Hao; Mingxuan Yuan; Bin Li

arXiv:2601.21709·cs.CL·January 30, 2026

Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

Qingyue Yang, Jie Wang, Xing Li, Yinqi Bai, Xialiang Tong, Huiling Zhen, Jianye Hao, Mingxuan Yuan, Bin Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces TAPPA, a unifying framework that explains diverse attention patterns in large language models by analyzing their mathematical properties from a temporal perspective, improving understanding and inference efficiency.

Contribution

TAPPA provides a unified mathematical analysis of attention patterns, linking their regularities to query self-similarity, and applies these insights to enhance LLM inference tasks.

Findings

01

Predictable attention patterns correlate with query self-similarity.

02

TAPPA's metrics improve KV cache compression performance.

03

Insights from TAPPA enhance LLM pruning methods.

Abstract

Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce \textbf{Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations} from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The analysis in e.g. Proposition 4.1 that links attention stability to query self-similarity, and how query drift induces changes in the logit changes, is novel and interesting; likewise for Th. 5.2 and 5.3, which provide conditions under which sequential/periodic diagonals appear - Using q-similarity in CAKE and ShortGPT yields sigificant improvements in several settings

Weaknesses

- Improvements in CAKE (Tab. 1) seem very marginal, are they statistically significant? Averages are not clearly reported - Computing q-similarities for every layer/head seems computationally expensive, but runtimes/costs are not discussed in-depth

Reviewer 02Rating 4Confidence 4

Strengths

- The temporal continuity perspective provides a systematic way to understand previously fragmented observations about attention patterns. The decomposition view connecting query similarity to pattern stability is intuitive. - Rigorous mathematical treatment: the theorems provide formal proofs for the emergence of different pattern types, with explicit bounds relating pattern stability to query/key properties and RoPE parameters. - Novel insight on periodic sequential patterns: The analysis of d

Weaknesses

- Limited novelty: the observation that query continuity drives attention stability was already made by AttentionPredictor (and the authors acknowledge that). While this paper provides mathematical formalization, the fundamental insight is not new. - KV cache compression: the improvements over CAKE are marginal and seem to be within noise margins. Other state of art methods such as DuoAttention, Expected Attention could be a stronger baseline. - LLM pruning is only compared against a single bas

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper provides a new perspective for explaining existing attention patterns from the query point of view. 2. The paper demonstrates how the proposed query-level observations can inform the design of sparse attention and layer pruning, adding practical value to the theoretical analysis.

Weaknesses

1. The claim of analyzing the *joint effect of input dynamics and positional encoding* seems overstated. While it would be valuable to disentangle and quantify their respective contributions, the paper instead merges them into the query with post-encoding. This makes the connection to the original input less clear than the abstract and introduction suggest. 2. Several assumptions used in the derivations are not carefully validated, which raises concerns about the reliability of the conclusions.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning in Healthcare