Fast Monte-Carlo Approximation of the Attention Mechanism
Hyunjun Kim, JeongGil Ko

TL;DR
This paper presents Monte-Carlo Attention (MCA), a randomized approximation method that significantly reduces the computational cost of self-attention in Transformer models by selectively approximating less important tokens.
Contribution
Introduces MCA, a novel randomized approximation technique for self-attention that reduces computational complexity without altering the model architecture.
Findings
MCA reduces attention computation by up to 11× in FLOPS.
MCA maintains model accuracy while approximating low-attention tokens.
Theoretical error bounds support the effectiveness of MCA.
Abstract
We introduce Monte-Carlo Attention (MCA), a randomized approximation method for reducing the computational cost of self-attention mechanisms in Transformer architectures. MCA exploits the fact that the importance of each token in an input sequence varies with respect to their attention scores; thus, some degree of error can be tolerable when encoding tokens with low attention. Using approximate matrix multiplication, MCA applies different error bounds to encode input tokens such that those with low attention scores are computed with relaxed precision, whereas errors of salient elements are minimized. MCA can operate in parallel with other attention optimization schemes and does not require model modification. We study the theoretical error bounds and demonstrate that MCA reduces attention complexity (in FLOPS) for various Transformer models by up to 11 in GLUE benchmarks without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques · Low-power high-performance VLSI design
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Dense Connections · Byte Pair Encoding · Absolute Position Encodings · Softmax · Dropout · Position-Wise Feed-Forward Layer
