On the token distance modeling ability of higher RoPE attention dimension
Xiangyu Hong, Che Jiang, Biqing Qi, Fandong Meng, Mo Yu, Bowen Zhou,, Jie Zhou

TL;DR
This paper investigates how higher-dimensional Rotary position embeddings (RoPE) in language models capture long-range dependencies, identifying specific attention heads that focus on long-distance information and contribute to length extrapolation.
Contribution
The study introduces a dimension-level analysis revealing Positional Heads that are crucial for long-range dependency modeling in length-extrapolated models.
Findings
Positional Heads focus on long-range information interaction.
High-dimensional attention allocation correlates with length extrapolation.
Ablation confirms the importance of Positional Heads in long input processing.
Abstract
Length extrapolation algorithms based on Rotary position embedding (RoPE) have shown promising results in extending the context length of language models. However, understanding how position embedding can capture longer-range contextual information remains elusive. Based on the intuition that different dimensions correspond to different frequency of changes in RoPE encoding, we conducted a dimension-level analysis to investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies. Using our correlation metric, we identified a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models. These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing, as evidence by our ablation. We further demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNuclear Physics and Applications
MethodsSoftmax · Attention Is All You Need · Focus
