AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference
Zhuomin He, Yizhen Yao, Pengfei Zuo, Bin Gao, Qinya Li, Zhenzhe Zheng,, Fan Wu

TL;DR
AdaSkip introduces an adaptive sublayer skipping technique that dynamically identifies and skips less important layers in long-context LLM inference, significantly reducing computational costs during both prefilling and decoding phases.
Contribution
The paper presents AdaSkip, a novel adaptive method for sublayer skipping tailored for long-context inference, addressing limitations of previous strategies by leveraging similarity information for dynamic layer importance assessment.
Findings
Achieves faster inference times on various long-context benchmarks.
Outperforms existing layer-skipping baselines in accuracy and speed.
Effectively reduces computational costs during both prefilling and decoding phases.
Abstract
Long-context large language models (LLMs) inference is increasingly critical, motivating a number of studies devoted to alleviating the substantial storage and computational costs in such scenarios. Layer-wise skipping methods are promising optimizations but rarely explored in long-context inference. We observe that existing layer-wise skipping strategies have several limitations when applied in long-context inference, including the inability to adapt to model and context variability, disregard for sublayer significance, and inapplicability for the prefilling phase. This paper proposes \sysname, an adaptive sublayer skipping method specifically designed for long-context inference. \sysname adaptively identifies less important layers by leveraging on-the-fly similarity information, enables sublayer-wise skipping, and accelerates both the prefilling and decoding phases. The effectiveness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
