Exploring and Reshaping the Weight Distribution in LLM
Chunming Ye, Songzhou Li, Xu Xu

TL;DR
This paper investigates the weight distribution in large language models, revealing power-law characteristics, and proposes a method to reshape LoRA weights based on these insights, improving training effectiveness.
Contribution
The study uncovers power-law distribution patterns in layer weights and introduces a novel data generation and weight reshaping method for LoRA training.
Findings
Power-law distribution of cosine distances between layer weights.
A new data generator based on Gaussian and Pareto distributions.
Improved LoRA training performance without changing model structure.
Abstract
The performance of Large Language Models is influenced by their characteristics such as architecture, model sizes, decoding methods and so on. Due to differences in structure or function, the weights in different layers of large models have varying distributions. This paper explores the correlations between different types of layers in terms of weights distribution and studies the potential impact of these correlations on LoRA training effectiveness. Firstly, the study reveals that in the model the cosine distances between weights of different layers manifest power-law distribution. We extract Query-projection, down-projection and other weight matrices from the self-attention layers and MLP layers, calculate the singular values of the matrices using singular value decomposition, and organize a certain number of singular values into matrices according to projection's type. By analyzing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Big Data and Digital Economy · Natural Language Processing Techniques
