Deep Fusion: Efficient Network Training via Pre-trained Initializations
Hanna Mazzawi, Xavi Gonzalvo, Michael Wunder, Sammy Jerome, Benoit, Dherin

TL;DR
Deep Fusion introduces an efficient training method for deep neural networks using pre-trained small networks and a theoretical framework to optimize training dynamics, significantly reducing training time and computational resources in NLP tasks.
Contribution
The paper presents a novel network training approach called Deep Fusion and a theoretical framework for understanding and optimizing network growth during training.
Findings
Deep Fusion accelerates training and reduces computational costs.
It maintains or improves performance compared to traditional methods.
The theoretical framework guides optimal training dynamics.
Abstract
In recent years, deep learning has made remarkable progress in a wide range of domains, with a particularly notable impact on natural language processing tasks. One of the challenges associated with training deep neural networks in the context of LLMs is the need for large amounts of computational resources and time. To mitigate this, network growing algorithms offer potential cost savings, but their underlying mechanisms are poorly understood. We present two notable contributions in this paper. First, we present Deep Fusion, an efficient approach to network training that leverages pre-trained initializations of smaller networks. Second, we propose a theoretical framework using backward error analysis to illustrate the dynamics of mid-training network growth. Our experiments show how Deep Fusion is a practical and effective approach that not only accelerates the training process but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsGated Linear Unit · Multi-Head Attention · Attention Is All You Need · Byte Pair Encoding · Linear Layer · Adafactor · SentencePiece · Layer Normalization · Residual Connection · Softmax
