FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing
Zekai Li, Jintu Zheng, Ji Liu, Han Liu, Haowei Zhu, Zeping Li, Fuwei, Yang, Haiduo Huang, Jinzhang Peng, Dong Li, Lu Tian, Emad Barsoum

TL;DR
This paper introduces FTP, a token-wise pruning method for large language models that uses a learnable router to selectively skip less important tokens, significantly reducing inference costs while maintaining high accuracy.
Contribution
The paper proposes a novel fine-grained token-wise pruning approach with a learnable router and a search-based sparsity scheduler, achieving state-of-the-art results in LLM pruning.
Findings
Outperforms existing pruning methods like BlockPruner and ShortGPT.
Achieves approximately 10 points higher accuracy retention at similar sparsity levels.
Demonstrates effectiveness across various benchmarks and LLMs.
Abstract
Recently, large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws, which significantly increase model size. However, the huge computation overhead during inference hinders the deployment in industrial applications. Many works leverage traditional compression approaches to boost model inference, but these always introduce additional training costs to restore the performance and the pruning results typically show noticeable performance drops compared to the original model when aiming for a specific level of acceleration. To address these issues, we propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens and skip them across model blocks to reduce computational cost during inference. To construct the router efficiently, we present a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsPruning
