On the token distance modeling ability of higher RoPE attention   dimension

Xiangyu Hong; Che Jiang; Biqing Qi; Fandong Meng; Mo Yu; Bowen Zhou,; Jie Zhou

arXiv:2410.08703·cs.CL·October 22, 2024

On the token distance modeling ability of higher RoPE attention dimension

Xiangyu Hong, Che Jiang, Biqing Qi, Fandong Meng, Mo Yu, Bowen Zhou,, Jie Zhou

PDF

Open Access 1 Video

TL;DR

This paper investigates how higher-dimensional Rotary position embeddings (RoPE) in language models capture long-range dependencies, identifying specific attention heads that focus on long-distance information and contribute to length extrapolation.

Contribution

The study introduces a dimension-level analysis revealing Positional Heads that are crucial for long-range dependency modeling in length-extrapolated models.

Findings

01

Positional Heads focus on long-range information interaction.

02

High-dimensional attention allocation correlates with length extrapolation.

03

Ablation confirms the importance of Positional Heads in long input processing.

Abstract

Length extrapolation algorithms based on Rotary position embedding (RoPE) have shown promising results in extending the context length of language models. However, understanding how position embedding can capture longer-range contextual information remains elusive. Based on the intuition that different dimensions correspond to different frequency of changes in RoPE encoding, we conducted a dimension-level analysis to investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies. Using our correlation metric, we identified a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models. These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing, as evidence by our ablation. We further demonstrate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the token distance modeling ability of higher RoPE attention dimension· underline

Taxonomy

TopicsNuclear Physics and Applications

MethodsSoftmax · Attention Is All You Need · Focus