GLASS: Global-Local Aggregation for Inference-time Sparsification of LLMs

Amirmohsen Sattarifard; Sepehr Lavasani; Kunlin Zhang; Amirhossein Rajabpour; Hanlin Xu; Fengyu Sun; Negar Hassanpour; Chao Gao

arXiv:2508.14302·cs.LG·May 14, 2026

GLASS: Global-Local Aggregation for Inference-time Sparsification of LLMs

Amirmohsen Sattarifard, Sepehr Lavasani, Kunlin Zhang, Amirhossein Rajabpour, Hanlin Xu, Fengyu Sun, Negar Hassanpour, Chao Gao

PDF

TL;DR

GLASS is a training-free framework that improves inference-time sparsification of LLMs by combining local prompt-specific and global model priors for more accurate neuron pruning, especially with short prompts.

Contribution

It introduces a novel global-local aggregation method for stable FFN pruning, enhancing performance in short-prompt, long-generation scenarios without additional training.

Findings

01

Achieves up to 45.10% lower perplexity

02

Reduces KL divergence by 25.73%

03

Provides significant on-device decoding speedup

Abstract

Inference-time sparsification is a promising path to deploy large language models (LLMs) on resource-constrained devices, yet existing training-free methods typically estimate feedforward network (FFN) neuron importance from the input prompt alone. We show this prompt-only signal is often unreliable, especially for short prompts and long-form decoding, leading to inaccurate masks and degraded generation fidelity. We propose GLASS, a plug-and-play, training-free framework that stabilizes dynamic FFN pruning by aggregating two complementary views of neuron criticality: local prompt-specific activations and a global model-intrinsic prior. GLASS fuses global and local signals via rank aggregation, yielding robust critical-neuron selection even when the prompt is short. We interpret GLASS as the maximum-a-posteriori consensus ranking under a permutation-based probabilistic model, providing a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.