FlashBias: Fast Computation of Attention with Bias
Haixu Wu, Minghao Guo, Yuezhou Ma, Yuanxu Sun, Jianmin Wang, Wojciech Matusik, Mingsheng Long

TL;DR
FlashBias introduces a low-rank based method to accelerate attention with bias in neural networks, significantly improving efficiency without sacrificing accuracy across vision, language, and scientific models.
Contribution
It provides a novel low-rank compression approach for fast exact or approximate computation of biased attention, addressing a key efficiency bottleneck.
Findings
Achieves 1.5× speedup in AlphaFold 3 with no accuracy loss
Over 2× speedup in vision and language models with maintained accuracy
Theoretically links optimal efficiency to the rank of attention weight matrices
Abstract
Attention with bias, which extends standard attention by introducing prior knowledge as an additive bias matrix to the query-key scores, has been widely deployed in vision, language, protein-folding and other advanced scientific models, underscoring its status as a key evolution of this foundational module. However, introducing bias terms creates a severe efficiency bottleneck in attention computation. It disrupts the tightly fused memory-compute pipeline that underlies the speed of accelerators like FlashAttention, thereby stripping away most of their performance gains and leaving biased attention computationally expensive. Surprisingly, despite its common usage, targeted efficiency optimization for attention with bias remains absent, which seriously hinders its application in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsBig Data and Digital Economy · Graph Theory and Algorithms · Stochastic Gradient Optimization Techniques
