POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models
Yi Chen, Wonjin Shin, Shuhong Liu, Tho Mai, Jeongmo Lee, Chuanbo Hua, Kun Wang, Jun Liu, Joo-Young Kim

TL;DR
POP introduces an online, context-aware structural pruning method for large foundation models that dynamically adjusts sparsity during inference, improving accuracy and efficiency without retraining.
Contribution
It proposes a novel online pruning framework that partitions model channels and dynamically generates masks during inference, enabling efficient, context-conditioned sparsity without offline calibration.
Findings
Outperforms existing pruning methods in accuracy across diverse models.
Reduces computational overhead and inference latency.
Applicable to LLMs, MoEs, and VLMs.
Abstract
Large foundation models (LFMs) achieve strong performance through scaling, yet current structural pruning methods derive fixed pruning decisions during inference, overlooking sparsity patterns that emerge in the autoregressive token generation. In this paper, we propose POP (Partition-guided Online Pruning), an efficient online structural pruning framework that enables context-conditioned dynamic pruning with minimal computational overhead. POP partitions model channels into retained, candidate, and pruned regions, where prefilling defines a coarse pruning partition, and the decoding stage generates a fine-grained mask within the candidate region, avoiding full-channel re-evaluation. The coarse pruning partition preserves consistently important weights, while the fine-grained masking provides context-conditioned variation during decoding. Moreover, POP is a lightweight, plug-and-play…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Topic Modeling · Advanced Neural Network Applications
