WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity

Lei Chen; Yuan Meng; Xiaoyu Zhan; Zhi Wang; Wenwu Zhu

arXiv:2602.14452·cs.LG·February 17, 2026

WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity

Lei Chen, Yuan Meng, Xiaoyu Zhan, Zhi Wang, Wenwu Zhu

PDF

Open Access 3 Reviews

TL;DR

WiSparse introduces a weight-aware, mixed-granularity activation sparsity method that significantly improves inference efficiency in large language models without retraining, maintaining high accuracy at high sparsity levels.

Contribution

This paper presents WiSparse, a novel training-free sparsity approach that combines activation and weight information with adaptive allocation to enhance LLM inference efficiency.

Findings

01

At 50% sparsity, WiSparse retains 97% of Llama3.1's dense performance.

02

WiSparse achieves a 21.4% inference speedup over baseline methods.

03

It surpasses existing activation sparsity techniques in accuracy preservation.

Abstract

Large Language Models (LLMs) offer strong capabilities but incur high inference costs due to dense computation and memory access. Training-free activation sparsity is a promising approach for efficient LLM inference, yet existing methods often rely solely on activation information and uniform sparsity ratios. This overlooks the critical interplay with weights and inter-block sensitivity variation, leading to suboptimal performance. We identify two key phenomena in modern LLMs: 1) less significant activations may align with highly important weights, and 2) sparsity sensitivity varies non-monotonically across model blocks. We propose Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse), which leverages both activation and weight information for adaptive sparsity allocation. Specifically, we introduce a weight-aware mechanism integrating activation magnitudes with…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- Clearly-motivated problem: shows empirical evidence that activation-only criteria can prune channels with small activations but very large weight columns, and that block-wise sparsity sensitivity is highly non-uniform. - Mixed-granularity sparsity allocation (block-level evolutionary search + layer-level greedy search) is reasonable. - Comprehensive empirical evaluation on three different 7–8B LLMs (Llama-3.1, Mistral, Qwen2.5) across multiple benchmarks, with consistent gains over strong trai

Weaknesses

- Conceptual novelty is somewhat limited relative to prior weight-aware sparsity (e.g., WINA) and activation-based methods (TEAL/R-Sparse). - The calibration and search pipeline appears non-trivial, but the paper does not quantify its wall-clock overhead or resource requirements. - Experiments are restricted to ~7–8B models and a single hardware setup; it is unclear how well WiSparse scales to larger models (e.g., 30B+) or different batch sizes.

Reviewer 02Rating 4Confidence 4

Strengths

1. The two insights are reasonable and well-motivated for the method design. 2. WiSparse conducted a more fine-grained sparsity design for the weight-activation-based sparsity paradigm, which makes it more robust. 3. The paper is well-written and easy to follow.

Weaknesses

1. The paper somehow lacks a significant novelty compared to WINA, which seems to be an incremental improvement for WINA. 2. The experimental comparison is insufficient, as I think WINA should be an important baseline. 3. Although the authors claimed that the static norm is inadequate, WiSparse still uses the L2 norm as the base, where the only difference is an exponential $\alpha_i$. Are there any insights about $\alpha_i$ across different layers?

Reviewer 03Rating 4Confidence 5

Strengths

- Paper is written and organized well and technically sound. - The mixed-granularity allocation is reasonable to bringing more performance gain

Weaknesses

- Lack of proper discussion. The weight awareness sparsity activation (eq 4, Sec 4.2) is the same as the one proposed by WINA. Though WiSparse discussed WINA in the related works, it would be suggested to further refer in Sec 4.2 to clarify the real contributions of this work. - Lack of numerical comparison. Conducting a direct numerical comparison with WINA to present the gain of mixed-granularity allocation is a recommendation. - Lack of discussion with more pruning works regarding block sp

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Natural Language Processing Techniques