The Super Weight in Large Language Models

Mengxia Yu; De Wang; Qi Shan; Colorado J Reed; Alvin Wan

arXiv:2411.07191·cs.CL·July 8, 2025

The Super Weight in Large Language Models

Mengxia Yu, De Wang, Qi Shan, Colorado J Reed, Alvin Wan

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper reveals that a tiny subset of parameters, called super weights, are crucial for LLM performance, and removing or quantizing them significantly impairs the model, leading to new methods for identifying and preserving these parameters.

Contribution

The paper introduces a data-free method to identify super weights in LLMs, demonstrating their critical role and proposing improved quantization techniques by preserving these weights.

Findings

01

Removing a single super weight drastically increases perplexity.

02

Preserving super weights improves quantization accuracy.

03

Super weights induce large activation outliers.

Abstract

Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present an even more surprising finding: Pruning as few as a single parameter can destroy an LLM's ability to generate text -- increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing. We propose a data-free method for identifying such parameters, termed super weights, using a single forward pass through the model. We additionally find that these super weights induce correspondingly rare and large activation outliers, termed super activations. When preserved with high precision, super activations can improve simple round-to-nearest…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 4

Strengths

The paper is well-written and effectively illustrates the importance of superweights and superactivations. I appreciate the discussion on the percolation of superactivations across the network and the identification of superweights across layers (Figure 3). Additionally, I find the potential implications of superweight upscaling presented in Figure 6 quite interesting.

Weaknesses

While I appreciate the analysis presented in this paper, I am struggling to see the novelty of this work. I may be misunderstanding, but from what I gather, superweights and superactivations have already been discussed in prior analyses of LLMs. Additionally, it seems that methods like AWQ and SqueezeLLM inherently focus on superactivations. Furthermore, compared to other weight quantization techniques, the proposed method does not appear to offer significant improvements.

Reviewer 02Rating 5Confidence 3

Strengths

Novel discovery about the importance of a few handful of neurons: The identification and analysis of super weights and super activations as critical outliers and their positive influence on model's performance is noteworthy and interesting. Quantization proposals: Authors went one step further to propose a super weight-aware quantization method to make the best use of these super weights/activations. Data free quantization proposal with on par performance compared to SmoothQuant is also a wort

Weaknesses

Though the discovery is quite interesting, the improvements of proposed methods with existing baselines are quite marginal. In general, such kind of super weights might be a natural phenomenon in any machine learning model. How can one say this is relevant only to LLM's? The work seems to be very much based on empirical observations (which is not my concern) but more discussions/intuitions/explanations around how/why these super weights are formed will be useful.

Reviewer 03Rating 1Confidence 5

Strengths

The authors conducted experimental explorations on the so-called "super weights."

Weaknesses

1. The necessity of "super weights" is unclear, as outliers are already identified based on the threshold. Increasing the threshold will naturally reduce the number of outliers with very large weights. Given the known importance of outliers in LLMs, emphasizing "super weights" (outliers at a higher threshold) does not appear novel. 2. Figure 1 is misleading. According to the author's definition, "super weights" are a subset of outliers. However, the figure suggests -1.9 is a typical outlier wit

Code & Models

Repositories

mengxiayu/llmsuperweight
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsPruning