On Fine-Grained I/O Complexity of Attention Backward Passes
Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Song Yue, Jiahao Zhang

TL;DR
This paper analyzes the I/O complexity of attention mechanisms in large language models, deriving tight bounds for various cache scenarios and proposing optimal algorithms for small-cache environments.
Contribution
It provides a comprehensive theoretical framework for I/O complexity in attention, including new algorithms for small-cache settings and bounds for sparse attention.
Findings
FlashAttention is optimal in large-cache scenarios.
A new algorithm outperforms existing methods in small-cache environments.
Established lower bounds for sparse attention across cache sizes.
Abstract
Large Language Models (LLMs) exhibit exceptional proficiency in handling extensive context windows in natural language. Nevertheless, the quadratic scaling of attention computation relative to sequence length creates substantial efficiency bottlenecks, necessitating the development of I/O-optimized algorithms. In this work, we conduct a systematic examination of the I/O complexity inherent in attention mechanisms, with a specific emphasis on the backward pass under both small and large cache settings. By leveraging the red-blue pebble game framework, we derive tight bounds for I/O complexity across the full spectrum of cache sizes. We validate that FlashAttention, one of the current industry standards, achieves optimality in the large-cache scenario for both forward and backward passes. Conversely, for small-cache environments, we introduce a novel algorithm that outperforms…
Peer Reviews
Decision·Submitted to ICLR 2026
- The authors show that there is room at small cache sizes, to potentially provide a speedup over FlashAttention by reducing I/O complexity. - The paper is pretty easy to follow and does quite a good job situating itself with respect to prior work.
- The authors do not provide an implementation of their algorithm, and so they cannot demonstrate that it actually provides a speedup over FlashAttention. The claim that the “algorithm designed for small cache sizes would become relevant and useful”, is speculative. In my view, this is the most significant limitation of this work. - The result is only applicable for very small cache sizes, and does not apply to modern GPUs typically used for training (A100s, H100s, B200s). - This paper (like pri
### Originality * Provides the first matching upper and lower bounds for the backward pass of exact attention for all cache sizes with a clean phase transition at ($M=\Theta(d^2)$) (Theorem 1.1). * Extends to sparse attention with lower bounds that recover the dense case as a special instance. ### Quality * Uses the red–blue pebble framework rigorously and states Theorem 1.1 with an explicit formula covering both regimes. * Gives matching bounds in each regime: large-cache upper (Thm 4.1
1. **Positioning vs prior work could be tighter.** The paper clearly cites Dao et al. (FlashAttention) and Saha & Ye for forward-pass tightness; it mentions Addanki et al. (streaming/approximate attention) in related work, but a compact comparison table clarifying different problem settings (exact vs approximate, streaming vs two-level memory) would help readers situate novelty. 2. **Practical relevance narrative.** The paper *does* discuss when small-cache arises (e.g., per-SM caches on older
- The paper's derivations seem to be solid and rigorous, to the best of my understanding. - The paper extends the results appearing in the previous work, thus completing the I/O complexity analysis for both forwards and backwards passes, small and large cache regimes, as well as dense and sparse attention. - The paper is well-written and easy to follow.
Overall, the paper seems to be a direct extension of [Saha & Ye, 2024], adding tight bounds for the I/O complexity of attention backwards pass. However, the results seem to directly mirror the prior work; the authors utilise the same framework, and provide similar asymptotic bounds and conclusions. Due to this, my impression is that the work, although mathematically solid, seems to be incremental. The small-cache algorithm, as well as theoretical derivations seem to follow directly from [Saha &
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntegrated Circuits and Semiconductor Failure Analysis · Semiconductor materials and devices · VLSI and Analog Circuit Testing
MethodsSoftmax · Attention Is All You Need · Attentive Walk-Aggregating Graph Neural Network
