Learned Token Pruning for Transformers
Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon,, Joseph Hassoun, Kurt Keutzer

TL;DR
This paper introduces Learned Token Pruning (LTP), a method that adaptively removes unimportant tokens in transformer models during inference, reducing computational cost while maintaining high accuracy.
Contribution
LTP is a novel threshold-based token pruning method that adaptively varies sequence length and outperforms prior methods in accuracy and efficiency on GLUE tasks.
Findings
LTP achieves up to 2.5% higher accuracy than previous methods.
LTP reduces FLOPs by up to 2.1x with less than 1% accuracy loss.
LTP improves throughput by up to 1.9x on CPUs and 2.0x on GPUs.
Abstract
Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes unimportant tokens as an input sequence passes through transformer layers. In particular, LTP prunes tokens with an attention score below a threshold value which is learned for each layer during training. Our threshold-based method allows the length of the pruned sequence to vary adaptively based on the input sequence, and avoids algorithmically expensive operations such as top-k token selection. We extensively test the performance of LTP on GLUE tasks and show that our method outperforms the prior state-of-the-art token pruning methods by up to ~2.5% higher accuracy with the same amount of FLOPs. In particular, LTP achieves up to 2.1x FLOPs reduction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Advanced Data Storage Technologies
MethodsPruning
