House of Cards: Massive Weights in LLMs

Jaehoon Oh; Seungjun Shin; Dokwan Oh

arXiv:2410.01866·cs.LG·February 7, 2025

House of Cards: Massive Weights in LLMs

Jaehoon Oh, Seungjun Shin, Dokwan Oh

PDF

Open Access

TL;DR

This paper investigates the role of massive activations in large language models, identifies their origin in intermediate feed-forward states, and introduces MacDrop, a dropout method that improves fine-tuning robustness by reducing reliance on massive weights.

Contribution

It reveals the origin of massive activations in intermediate states, analyzes the impact of massive weights, and proposes MacDrop, a novel dropout technique for more robust parameter-efficient fine-tuning.

Findings

01

Massive weights are crucial for LLM functionality.

02

Zeroing massive weights disrupts models, but removing all weights except massive ones has minor impact.

03

MacDrop improves performance and robustness during fine-tuning.

Abstract

Massive activations, which manifest in specific feature dimensions of hidden states, introduce a significant bias in large language models (LLMs), leading to an overemphasis on the corresponding token. In this paper, we identify that massive activations originate not from the hidden state but from the intermediate state of a feed-forward network module in an early layer. Expanding on the previous observation that massive activations occur only in specific feature dimensions, we dive deep into the weights that cause massive activations. Specifically, we define top- $k$ massive weights as the weights that contribute to the dimensions with the top- $k$ magnitudes in the intermediate state. When these massive weights are set to zero, the functionality of LLMs is entirely disrupted. However, when all weights except for massive weights are set to zero, it results in a relatively minor…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDiverse Research and Applications

MethodsSparse Evolutionary Training · Dropout