GLASS: Global-Local Aggregation for Inference-time Sparsification of LLMs
Amirmohsen Sattarifard, Sepehr Lavasani, Kunlin Zhang, Amirhossein Rajabpour, Hanlin Xu, Fengyu Sun, Negar Hassanpour, Chao Gao

TL;DR
GLASS is a training-free framework that improves inference-time sparsification of LLMs by combining local prompt-specific and global model priors for more accurate neuron pruning, especially with short prompts.
Contribution
It introduces a novel global-local aggregation method for stable FFN pruning, enhancing performance in short-prompt, long-generation scenarios without additional training.
Findings
Achieves up to 45.10% lower perplexity
Reduces KL divergence by 25.73%
Provides significant on-device decoding speedup
Abstract
Inference-time sparsification is a promising path to deploy large language models (LLMs) on resource-constrained devices, yet existing training-free methods typically estimate feedforward network (FFN) neuron importance from the input prompt alone. We show this prompt-only signal is often unreliable, especially for short prompts and long-form decoding, leading to inaccurate masks and degraded generation fidelity. We propose GLASS, a plug-and-play, training-free framework that stabilizes dynamic FFN pruning by aggregating two complementary views of neuron criticality: local prompt-specific activations and a global model-intrinsic prior. GLASS fuses global and local signals via rank aggregation, yielding robust critical-neuron selection even when the prompt is short. We interpret GLASS as the maximum-a-posteriori consensus ranking under a permutation-based probabilistic model, providing a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
