POP: Prefill-Only Pruning for Efficient Large Model Inference
Junhui He, Zhihui Fu, Jun Wang, Qingan Li

TL;DR
This paper introduces Prefill-Only Pruning (POP), a stage-aware inference method that selectively prunes deep layers during the prefill stage of large models, significantly improving efficiency with minimal accuracy loss.
Contribution
The paper proposes a novel stage-aware pruning strategy that distinguishes between prefill and decode stages, enabling more efficient large model inference without substantial accuracy degradation.
Findings
POP achieves up to 1.37× speedup in prefill latency.
Prefill-Only Pruning maintains model accuracy with minimal performance loss.
The method is effective across diverse modalities and large models.
Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
