TL;DR
This paper introduces ToP, a novel token pruning method for transformers that improves inference speed and maintains accuracy by using ranking distillation and a coarse-to-fine pruning strategy, validated on benchmarks.
Contribution
The paper proposes a constraint-aware, ranking-distilled token pruning approach with automatic layer selection and improved regularization, advancing transformer efficiency.
Findings
ToP reduces BERT FLOPs by 8.1x with competitive accuracy.
Achieves up to 7.4x real latency speedup on CPU.
Outperforms existing token pruning methods on GLUE and SQuAD.
Abstract
Deploying pre-trained transformer models like BERT on downstream tasks in resource-constrained scenarios is challenging due to their high inference cost, which grows rapidly with input sequence length. In this work, we propose a constraint-aware and ranking-distilled token pruning method ToP, which selectively removes unnecessary tokens as input sequence passes through layers, allowing the model to improve online inference speed while preserving accuracy. ToP overcomes the limitation of inaccurate token importance ranking in the conventional self-attention mechanism through a ranking-distilled token distillation technique, which distills effective token rankings from the final layer of unpruned models to early layers of pruned models. Then, ToP introduces a coarse-to-fine pruning approach that automatically selects the optimal subset of transformer layers and optimizes token pruning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Layer Normalization · Attention Dropout · WordPiece · Dense Connections · Adam
