TL;DR
This paper introduces GISP, a global structured pruning method for large language models that improves efficiency and downstream task performance without intermediate fine-tuning.
Contribution
The paper presents GISP, a novel iterative, importance-based global pruning approach that stabilizes accuracy at high sparsity and supports task-specific objectives.
Findings
GISP reduces perplexity on WikiText-2 across multiple LLMs.
GISP improves downstream accuracy, especially at 40-50% sparsity.
Task-specific calibration enhances accuracy on decision tasks.
Abstract
Structured pruning is a practical approach to deploying large language models (LLMs) efficiently, as it yields compact, hardware-friendly architectures. However, the dominant local paradigm is task-agnostic: by optimizing layer-wise reconstruction rather than task objectives, it tends to preserve perplexity or generic zero-shot behavior but fails to capitalize on modest task-specific calibration signals, often yielding limited downstream gains. We revisit global structured pruning and present GISP, Global Iterative Structured Pruning, a post-training method that removes attention heads and MLP channels using first-order, loss-based important scores aggregated at the structure level with block-wise normalization. Built on this global importance metric, GISP adopts an iterative schedule, rather than one-shot pruning, stabilizes accuracy at higher sparsity, and mitigates perplexity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
