Fast Monte-Carlo Approximation of the Attention Mechanism

Hyunjun Kim; JeongGil Ko

arXiv:2201.12854·cs.LG·February 1, 2022

Fast Monte-Carlo Approximation of the Attention Mechanism

Hyunjun Kim, JeongGil Ko

PDF

Open Access 1 Video

TL;DR

This paper presents Monte-Carlo Attention (MCA), a randomized approximation method that significantly reduces the computational cost of self-attention in Transformer models by selectively approximating less important tokens.

Contribution

Introduces MCA, a novel randomized approximation technique for self-attention that reduces computational complexity without altering the model architecture.

Findings

01

MCA reduces attention computation by up to 11× in FLOPS.

02

MCA maintains model accuracy while approximating low-attention tokens.

03

Theoretical error bounds support the effectiveness of MCA.

Abstract

We introduce Monte-Carlo Attention (MCA), a randomized approximation method for reducing the computational cost of self-attention mechanisms in Transformer architectures. MCA exploits the fact that the importance of each token in an input sequence varies with respect to their attention scores; thus, some degree of error can be tolerable when encoding tokens with low attention. Using approximate matrix multiplication, MCA applies different error bounds to encode input tokens such that those with low attention scores are computed with relaxed precision, whereas errors of salient elements are minimized. MCA can operate in parallel with other attention optimization schemes and does not require model modification. We study the theoretical error bounds and demonstrate that MCA reduces attention complexity (in FLOPS) for various Transformer models by up to 11 $\times$ in GLUE benchmarks without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Fast Monte-Carlo Approximation of the Attention Mechanism· underline

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques · Low-power high-performance VLSI design

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Dense Connections · Byte Pair Encoding · Absolute Position Encodings · Softmax · Dropout · Position-Wise Feed-Forward Layer