TL;DR
This paper introduces FCA, a hybrid self-attention mechanism for BERT that reduces computational cost by dynamically switching between fine- and coarse-grained tokens, maintaining accuracy while halving FLOPs.
Contribution
FCA is a novel hybrid self-attention method that adaptively shortens sequence length, improving efficiency without significant accuracy loss in BERT models.
Findings
FCA achieves 2x reduction in FLOPs on benchmarks.
FCA maintains less than 1% accuracy loss.
Outperforms prior methods in accuracy-FLOPs trade-off.
Abstract
Transformer-based pre-trained models, such as BERT, have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, deploying these models can be prohibitively costly, as the standard self-attention mechanism of the Transformer suffers from quadratic computational cost in the input sequence length. To confront this, we propose FCA, a fine- and coarse-granularity hybrid self-attention that reduces the computation cost through progressively shortening the computational sequence length in self-attention. Specifically, FCA conducts an attention-based scoring strategy to determine the informativeness of tokens at each layer. Then, the informative tokens serve as the fine-granularity computing units in self-attention and the uninformative tokens are replaced with one or several clusters as the coarse-granularity computing units…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Label Smoothing
