When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji, Kawaguchi, Tianyu Pang

TL;DR
This paper identifies numerical issues with RoPE in long-context training using BFloat16 and introduces AnchorAttention, a novel method that improves long-context performance and training efficiency in large language models.
Contribution
The paper proposes AnchorAttention, a plug-and-play attention mechanism that mitigates BFloat16 numerical issues in RoPE, enhancing long-context processing and reducing training time.
Findings
AnchorAttention significantly improves long-context performance.
Training time is reduced by over 50% with AnchorAttention.
The method preserves general task capabilities of LLMs.
Abstract
Extending context window sizes allows large language models (LLMs) to process longer sequences and handle more complex tasks. Rotary Positional Embedding (RoPE) has become the de facto standard due to its relative positional encoding properties that benefit long-context training. However, we observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding, especially in long-context scenarios. This issue arises from BFloat16's limited precision and accumulates as context length increases, with the first token contributing significantly to this problem. To address this, we develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16, improves long-context capabilities, and speeds up training. AnchorAttention reduces unnecessary attention computations, maintains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetics and Physical Performance · Sports Performance and Training
MethodsSoftmax · Attention Is All You Need
