Constraint-aware and Ranking-distilled Token Pruning for Efficient   Transformer Inference

Junyan Li; Li Lyna Zhang; Jiahang Xu; Yujing Wang; Shaoguang Yan,; Yunqing Xia; Yuqing Yang; Ting Cao; Hao Sun; Weiwei Deng; Qi Zhang; Mao Yang

arXiv:2306.14393·cs.CL·June 27, 2023

Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

Junyan Li, Li Lyna Zhang, Jiahang Xu, Yujing Wang, Shaoguang Yan,, Yunqing Xia, Yuqing Yang, Ting Cao, Hao Sun, Weiwei Deng, Qi Zhang, Mao Yang

PDF

1 Repo

TL;DR

This paper introduces ToP, a novel token pruning method for transformers that improves inference speed and maintains accuracy by using ranking distillation and a coarse-to-fine pruning strategy, validated on benchmarks.

Contribution

The paper proposes a constraint-aware, ranking-distilled token pruning approach with automatic layer selection and improved regularization, advancing transformer efficiency.

Findings

01

ToP reduces BERT FLOPs by 8.1x with competitive accuracy.

02

Achieves up to 7.4x real latency speedup on CPU.

03

Outperforms existing token pruning methods on GLUE and SQuAD.

Abstract

Deploying pre-trained transformer models like BERT on downstream tasks in resource-constrained scenarios is challenging due to their high inference cost, which grows rapidly with input sequence length. In this work, we propose a constraint-aware and ranking-distilled token pruning method ToP, which selectively removes unnecessary tokens as input sequence passes through layers, allowing the model to improve online inference speed while preserving accuracy. ToP overcomes the limitation of inaccurate token importance ranking in the conventional self-attention mechanism through a ranking-distilled token distillation technique, which distills effective token rankings from the final layer of unpruned models to early layers of pruned models. Then, ToP introduces a coarse-to-fine pruning approach that automatically selects the optimal subset of transformer layers and optimizes token pruning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/moonlit
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Layer Normalization · Attention Dropout · WordPiece · Dense Connections · Adam