Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei

TL;DR
Q-Sparse introduces a sparsity technique for large language models that maintains performance while significantly improving inference efficiency through top-K activation sparsification and block processing.
Contribution
It presents a novel sparsification method for LLMs that achieves full activation sparsity, enabling more efficient inference without sacrificing accuracy.
Findings
Q-Sparse achieves comparable results to baseline LLMs with higher inference efficiency.
An inference-optimal scaling law for sparsely-activated LLMs is developed.
Q-Sparse is effective across training-from-scratch, continue-training, and fine-tuning scenarios.
Abstract
We introduce, Q-Sparse, a simple yet effective approach to training sparsely-activated large language models (LLMs). Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference. This is achieved by applying top-K sparsification to the activations and the straight-through-estimator to the training. We also introduce Block Q-Sparse for batch training and inference. The key results from this work are, (1) Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time; (2) We present an inference-optimal scaling law for sparsely-activated LLMs; (3) Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning; (4) Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The paper is well-written and easy to follow. 2. The extension of the scaling law to sparsely-activated LLMs in Section 3 is well-motivated and innovative. 3. Experimental evaluations are comprehensive, including both dense and quantized LLMs, reflecting realistic deployment settings.
1. MoE models are also a type of sparsely-activated model. It would be valuable to include a comparison between MoE models and the proposed sparsely-activated models in terms of accuracy and efficiency. 2. A concern lies in the actual inference speedup of the proposed method, especially during the decoding step in batched inference. In batched settings, the columns activated may vary across cases, and in the worst case, it may still be necessary to load all weights during each decoding step. Sin
1. The sparsely-activated models with around 40% sparsity ratio can perform comparably to the dense baselines with the same model size and training tokens. 2. The performance gap between the sparsely-activated models and the dense baselines decreases as the number of parameters goes up.
1. The paper mentioned "Q-Sparse enables full sparsity of activations in LLMs which can bring significant efficiency gains in inference", but there is neither discussion nor experiment about measuring the execution time and/or peak GPU memory usage. 2. The evaluation for math (e.g., GSM8K, MATH) and code (e.g., HumanEval, MBPP) would be required to ensure that the performance of sparsely-activated models can match the performance of the dense baselines with the same model size and training toke
The authors present valuable works on analyzing the impact of sparsity on LLM.
This paper lacks of methodological clarity in terms of novelty and the structure of the paper is not clear. Moreover, methodological novelty is limited since the proposed sparsity method may be considered as trick novelty in different components of LLM.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
