The Super Weight in Large Language Models
Mengxia Yu, De Wang, Qi Shan, Colorado J Reed, Alvin Wan

TL;DR
This paper reveals that a tiny subset of parameters, called super weights, are crucial for LLM performance, and removing or quantizing them significantly impairs the model, leading to new methods for identifying and preserving these parameters.
Contribution
The paper introduces a data-free method to identify super weights in LLMs, demonstrating their critical role and proposing improved quantization techniques by preserving these weights.
Findings
Removing a single super weight drastically increases perplexity.
Preserving super weights improves quantization accuracy.
Super weights induce large activation outliers.
Abstract
Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present an even more surprising finding: Pruning as few as a single parameter can destroy an LLM's ability to generate text -- increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing. We propose a data-free method for identifying such parameters, termed super weights, using a single forward pass through the model. We additionally find that these super weights induce correspondingly rare and large activation outliers, termed super activations. When preserved with high precision, super activations can improve simple round-to-nearest…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper is well-written and effectively illustrates the importance of superweights and superactivations. I appreciate the discussion on the percolation of superactivations across the network and the identification of superweights across layers (Figure 3). Additionally, I find the potential implications of superweight upscaling presented in Figure 6 quite interesting.
While I appreciate the analysis presented in this paper, I am struggling to see the novelty of this work. I may be misunderstanding, but from what I gather, superweights and superactivations have already been discussed in prior analyses of LLMs. Additionally, it seems that methods like AWQ and SqueezeLLM inherently focus on superactivations. Furthermore, compared to other weight quantization techniques, the proposed method does not appear to offer significant improvements.
Novel discovery about the importance of a few handful of neurons: The identification and analysis of super weights and super activations as critical outliers and their positive influence on model's performance is noteworthy and interesting. Quantization proposals: Authors went one step further to propose a super weight-aware quantization method to make the best use of these super weights/activations. Data free quantization proposal with on par performance compared to SmoothQuant is also a wort
Though the discovery is quite interesting, the improvements of proposed methods with existing baselines are quite marginal. In general, such kind of super weights might be a natural phenomenon in any machine learning model. How can one say this is relevant only to LLM's? The work seems to be very much based on empirical observations (which is not my concern) but more discussions/intuitions/explanations around how/why these super weights are formed will be useful.
The authors conducted experimental explorations on the so-called "super weights."
1. The necessity of "super weights" is unclear, as outliers are already identified based on the threshold. Increasing the threshold will naturally reduce the number of outliers with very large weights. Given the known importance of outliers in LLMs, emphasizing "super weights" (outliers at a higher threshold) does not appear novel. 2. Figure 1 is misleading. According to the author's definition, "super weights" are a subset of outliers. However, the figure suggests -1.9 is a typical outlier wit
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsPruning
