Fast RoPE Attention: Combining the Polynomial Method and Fast Fourier Transform

Josh Alman; Zhao Song

arXiv:2505.11892·cs.LG·May 20, 2025

Fast RoPE Attention: Combining the Polynomial Method and Fast Fourier Transform

Josh Alman, Zhao Song

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel algorithm for fast RoPE attention computation in transformers, combining the polynomial method and FFT to achieve near-linear time complexity under bounded entry assumptions.

Contribution

It presents the first almost linear time algorithm for RoPE attention, overcoming previous limitations caused by position embeddings.

Findings

01

Achieves near-linear time complexity for RoPE attention

02

Demonstrates the effectiveness of combining polynomial method and FFT

03

Validates the bounded entry assumption as necessary for efficiency

Abstract

The transformer architecture has been widely applied to many machine learning tasks. A main bottleneck in the time to perform transformer computations is a task called attention computation. [Alman and Song, NeurIPS 2023] have shown that in the bounded entry regime, there is an almost linear time algorithm to approximate the attention computation. They also proved that the bounded entry assumption is necessary for a fast algorithm assuming the popular Strong Exponential Time Hypothesis. A new version of transformer which uses position embeddings has recently been very successful. At a high level, position embedding enables the model to capture the correlations between tokens while taking into account their position in the sequence. Perhaps the most popular and effective version is Rotary Position Embedding (RoPE), which was proposed by [Su, Lu, Pan, Murtadha, Wen, and Liu,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

Clear Motivation & Relevance: The paper highlights the importance of efficient attention mechanisms, especially as RoPE becomes standard in large LLMs (Llama, Claude, Gemini, Apple, etc.). Originality: The combination of the polynomial method with FFT for rescaled Toeplitz matrices is novel, and the authors identify why previous techniques fail in the RoPE case. Theoretical Rigor: Strong upper and lower bounds are established. The authors are meticulous in showing tightness of their results, c

Weaknesses

Clarity: Some sections (esp. regarding structured matrix manipulations) assume a degree of background with FFT applications and polynomial approximations in algorithms. Additional diagrams or simplified intuition would make the work more accessible to a wider ML/AI audience. Related Work Scope: The related work is comprehensive regarding theoretical literature, but more discussion about current practical/engineering solutions for fast attention (e.g., FlashAttention variants, hardware-acceler

Reviewer 02Rating 4Confidence 3

Strengths

(1) The motivation of this paper is clear. The paper explains why classic polynomial-method low-rank arguments break under RoPE (Toeplitz-like structure rather than low rank) and why FFT is the right technique. (2) The method incorporates the polynomial approximation and fast computation of FFTs.

Weaknesses

(1) The theorems are asymptotic; it would help to expose the exact dependence on the polynomial degree and the number of rescaled-Toeplitz summands t after approximation. Here, n is the sequence length. In real applications, will it be approaching \infty? I think the real LLMs have a sliding window, and n is not very large, right? (2) The paper does not conduct any experiments to show the improvement of the computation efficiency. I suggest including some experiments (even small synthetic ones)

Reviewer 03Rating 2Confidence 4

Strengths

RoPE is a cornerstone of many state-of-the-art LLMs (Llama, Claude, etc.), and developing faster algorithms for it is of significant practical interest.

Weaknesses

1). The technical novelty is limited. The paper uses FFT to handle Toeplitz-like structures in positional encodings, which is also a known approach in existing models [1]. The primary contribution is the specific application of this technique to RoPE and combining it with the polynomial method which is yet another known method. While this is a valid contribution, the paper fails to articulate more generalizable algorithmic insight beyond this direct combination. [1] Qin, Zhen, et al. "Toeplitz

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Advanced Graph Neural Networks

MethodsSoftmax · Attention Is All You Need