Fine- and Coarse-Granularity Hybrid Self-Attention for Efficient BERT

Jing Zhao; Yifan Wang; Junwei Bao; Youzheng Wu; Xiaodong He

arXiv:2203.09055·cs.CL·March 18, 2022

Fine- and Coarse-Granularity Hybrid Self-Attention for Efficient BERT

Jing Zhao, Yifan Wang, Junwei Bao, Youzheng Wu, Xiaodong He

PDF

1 Repo

TL;DR

This paper introduces FCA, a hybrid self-attention mechanism for BERT that reduces computational cost by dynamically switching between fine- and coarse-grained tokens, maintaining accuracy while halving FLOPs.

Contribution

FCA is a novel hybrid self-attention method that adaptively shortens sequence length, improving efficiency without significant accuracy loss in BERT models.

Findings

01

FCA achieves 2x reduction in FLOPs on benchmarks.

02

FCA maintains less than 1% accuracy loss.

03

Outperforms prior methods in accuracy-FLOPs trade-off.

Abstract

Transformer-based pre-trained models, such as BERT, have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, deploying these models can be prohibitively costly, as the standard self-attention mechanism of the Transformer suffers from quadratic computational cost in the input sequence length. To confront this, we propose FCA, a fine- and coarse-granularity hybrid self-attention that reduces the computation cost through progressively shortening the computational sequence length in self-attention. Specifically, FCA conducts an attention-based scoring strategy to determine the informativeness of tokens at each layer. Then, the informative tokens serve as the fine-granularity computing units in self-attention and the uninformative tokens are replaced with one or several clusters as the coarse-granularity computing units…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pierre-zhao/fca-bert
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Position-Wise Feed-Forward Layer · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Label Smoothing