Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference
Rei Taniguchi, Yuyang Dong, Makoto Onizuka, and Chuan Xiao

TL;DR
This paper introduces ASL, a training-free adaptive layer selection method for token pruning in LLM inference that improves accuracy and speed trade-offs across various tasks.
Contribution
ASL adaptively chooses the layer for token pruning based on attention score variance, enhancing flexibility and performance without additional training.
Findings
ASL outperforms existing methods on multiple benchmarks.
It balances inference speed and accuracy effectively.
ASL is compatible with existing KV cache reduction techniques.
Abstract
Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches, which select a subset of tokens at particular layers to retain in KV cache and prune others, are one of the most popular schemes. They primarily adopt a set of pre-defined layers, at which tokens are selected. Such design is inflexible in the sense that the accuracy significantly varies across tasks and deteriorates in harder tasks such as KV retrieval. In this paper, we propose ASL, a training-free method that adaptively chooses the selection layer for KV cache reduction, exploiting the variance of token ranks ordered by attention score. The proposed method balances the performance across different tasks while meeting the user-specified KV budget…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
