TL;DR
This paper challenges the belief that small weights in large language models are redundant, showing they encode crucial knowledge for difficult tasks and that pruning them irreversibly impairs performance.
Contribution
It introduces the Junk DNA Hypothesis, demonstrating that pruning small weights monotonically degrades performance on hard tasks and that quantization does not have the same effect.
Findings
Pruning small weights causes monotonic performance drops on difficult tasks.
Small weights encode essential knowledge for challenging downstream tasks.
Quantization does not exhibit similar monotonic effects as pruning.
Abstract
We present Junk DNA Hypothesis by adopting a novel task-centric angle for the pre-trained weights of large language models (LLMs). It has been believed that weights in LLMs contain significant redundancy, leading to the conception that a considerable chunk of the parameters can be removed by pruning without compromising performance. Contrary to this belief, this paper presents a counter-argument: small-magnitude weights of pre-trained model weights encode vital knowledge essential for tackling difficult downstream tasks - manifested as the monotonic relationship between the performance drop of downstream tasks across the difficulty spectrum, as we prune more pre-trained weights by magnitude. Moreover, we reveal that these seemingly inconsequential weights can result in irreparable loss of knowledge and performance degradation in difficult tasks, even when downstream continual training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
