Fractional Rotation, Full Potential? Investigating Performance and Convergence of Partial RoPE
Mohammad Aflah Khan, Krishna P. Gummadi, Manish Gupta, Abhilasha Ravichander

TL;DR
This paper systematically investigates the effects of applying Rotary Positional Embedding (RoPE) to only a fraction of transformer dimensions, revealing that minimal RoPE usage can save memory while maintaining comparable performance and offering practical training insights.
Contribution
It provides the first comprehensive analysis of partial RoPE, demonstrating that applying RoPE to about 10% of dimensions suffices for convergence and stability across various models and datasets.
Findings
Applying RoPE to 10% of dimensions achieves similar convergence to full RoPE.
Partial RoPE significantly reduces memory usage, up to 10x savings.
Minimal RoPE application improves training stability, especially with NoPE.
Abstract
Rotary Positional Embedding (RoPE) is a common choice in transformer architectures for encoding relative positional information. Although earlier work has examined omitting RoPE in specific layers, the effect of varying the fraction of hidden dimensions that receive rotary transformations remains largely unexplored. This design choice can yield substantial memory savings, which becomes especially significant at long context lengths. We find up to 10x memory savings over the standard RoPE cache, while achieving comparable final loss. In this work, we present a systematic study examining the impact of partial RoPE on training dynamics and convergence across architectures and datasets. Our findings uncover several notable patterns: (1) applying RoPE to only a small fraction of dimensions (around 10%) achieves convergence comparable to using full RoPE; (2) these trends hold consistently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Multimodal Machine Learning Applications
