WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference
Sihan Chen, Dan Zhao, Jongwoo Ko, Colby Banbury, Huiping Zhuang, Luming Liang, Pashmina Cameron, Tianyi Chen

TL;DR
WINA introduces a training-free sparse activation method for large language models that combines hidden state magnitudes and weight norms, achieving tighter approximation bounds and outperforming existing methods in inference efficiency.
Contribution
WINA presents a novel, simple, training-free sparse activation framework that improves approximation accuracy and inference performance for large language models by leveraging weight and activation information.
Findings
WINA outperforms state-of-the-art methods like TEAL by up to 2.94% in average performance.
WINA achieves tighter theoretical approximation error bounds than existing techniques.
Empirical results demonstrate WINA's effectiveness across diverse LLM architectures and datasets.
Abstract
The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise -norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal…
Peer Reviews
Decision·ICLR 2026 Poster
- Very simple, plug-and-play rule that is easy to implement on top of existing sparse-activation baselines.
- The paper reports GFLOP reductions but does not clearly explain whether WINA’s gating is used to avoid weight loads or only to mask post-matmul activations; without a truly sparse kernel and latency measurements, it is unclear how much real speedup WINA provides over TEAL/CATS in memory-bound, batch-1 inference.
The problem addressed is highly relevant, as reducing LLM inference cost without sacrificing output quality is an important challenge. The method is theoretically grounded, as incorporating weight norms provides a principled sparsification strategy with provable error bounds. The empirical evaluation is extensive, covering multiple LLMs, a range of tasks, and both low and high sparsity levels, and the method is compared to strong baselines including TEAL, R-Sparse, and CATS. The approach is prac
The main contribution is a relatively straightforward extension of existing sparse activation methods, which could be considered incremental, though it is strengthened by solid theoretical and empirical support. The paper could benefit from a discussion of potential limitations, such as scenarios where weight-informed selection might be less effective or challenges when scaling to very large models beyond those tested.
- The proposed method introduces a simple yet effective training-free sparse activation mechanism that combines both hidden-state magnitudes and the column-wise L2-norm of weight matrices to guide neuron selection. - The theoretical analysis is rigorous and well structured, providing provably optimal approximation error bounds under clear and interpretable assumptions (column-wise orthogonality and monotonic activation). - The experiments are comprehensive, covering multiple model architectures,
- The models in the experiments are small dense LLMs. Large-scale or MoE architectures (e.g., DeepSeek-V3, Llama4, GPT-OSS) which are more common in product deployment workloads are not tested. It’s unclear whether WINA’s activation gating would maintain efficiency with expert routing sparsity in these larger models. - The evaluation focuses on theoretical FLOPs reduction but lacks real-world inference measurements such as latency or throughput on inference frameworks. Without kernel-level or ru
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
