FTP: A Fine-grained Token-wise Pruner for Large Language Models via   Token Routing

Zekai Li; Jintu Zheng; Ji Liu; Han Liu; Haowei Zhu; Zeping Li; Fuwei; Yang; Haiduo Huang; Jinzhang Peng; Dong Li; Lu Tian; Emad Barsoum

arXiv:2412.11494·cs.CL·December 17, 2024

FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing

Zekai Li, Jintu Zheng, Ji Liu, Han Liu, Haowei Zhu, Zeping Li, Fuwei, Yang, Haiduo Huang, Jinzhang Peng, Dong Li, Lu Tian, Emad Barsoum

PDF

Open Access

TL;DR

This paper introduces FTP, a token-wise pruning method for large language models that uses a learnable router to selectively skip less important tokens, significantly reducing inference costs while maintaining high accuracy.

Contribution

The paper proposes a novel fine-grained token-wise pruning approach with a learnable router and a search-based sparsity scheduler, achieving state-of-the-art results in LLM pruning.

Findings

01

Outperforms existing pruning methods like BlockPruner and ShortGPT.

02

Achieves approximately 10 points higher accuracy retention at similar sparsity levels.

03

Demonstrates effectiveness across various benchmarks and LLMs.

Abstract

Recently, large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws, which significantly increase model size. However, the huge computation overhead during inference hinders the deployment in industrial applications. Many works leverage traditional compression approaches to boost model inference, but these always introduce additional training costs to restore the performance and the pruning results typically show noticeable performance drops compared to the original model when aiming for a specific level of acceleration. To address these issues, we propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens and skip them across model blocks to reduce computational cost during inference. To construct the router efficiently, we present a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsPruning