A Mean Field Ansatz for Zero-Shot Weight Transfer
Xingyuan Chen, Wenwei Kuang, Lei Deng, Wei Han, Bo Bai, Goncalo dos, Reis

TL;DR
This paper introduces a mean field theoretical framework to explain zero-shot weight transfer in large language models, providing a new perspective that supports the empirical effectiveness of weight transfer across different model sizes.
Contribution
It proposes the row-column mean field ansatz to model weight distributions and offers a theoretical explanation for zero-shot weight transfer in neural networks.
Findings
The RC ansatz describes weight measure structures effectively.
Empirical validation on GPT-3 and Llama-3.1 supports the theory.
The mean field perspective explains the success of weight transfer.
Abstract
The pre-training cost of large language models (LLMs) is prohibitive. One cutting-edge approach to reduce the cost is zero-shot weight transfer, also known as model growth for some cases, which magically transfers the weights trained in a small model to a large model. However, there are still some theoretical mysteries behind the weight transfer. In this paper, inspired by prior applications of mean field theory to neural network dynamics, we introduce a mean field ansatz to provide a theoretical explanation for weight transfer. Specifically, we propose the row-column (RC) ansatz under the mean field point of view, which describes the measure structure of the weights in the neural network (NN) and admits a close measure dynamic. Thus, the weights of different sizes NN admit a common distribution under proper assumptions, and weight transfer methods can be viewed as sampling methods. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced SAR Imaging Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Weight Decay · Adam · Byte Pair Encoding · Softmax · Linear Layer · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Dropout · Dense Connections
