Scaling Studies for Efficient Parameter Search and Parallelism for Large Language Model Pre-training
Michael Benington, Leo Phan, Chris Pierre Paul, Evan Shoemaker,, Priyanka Ranade, Torstein Collett, Grant Hodgson Perez, Christopher Krieger

TL;DR
This paper investigates how different parallelism strategies, especially Microsoft DeepSpeed ZeRO stages, affect the efficiency of training large language models with up to 13 billion parameters.
Contribution
It provides a detailed analysis of parallelism techniques for large-scale LLM pre-training, focusing on optimizing data processing and resource utilization.
Findings
Quantified relationships between parallelism methods.
Evaluated efficiency of ZeRO stages in large model training.
Provided insights for scalable LLM pre-training strategies.
Abstract
AI accelerator processing capabilities and memory constraints largely dictate the scale in which machine learning workloads (e.g., training and inference) can be executed within a desirable time frame. Training a state of the art, transformer-based model today requires use of GPU-accelerated high performance computers with high-speed interconnects. As datasets and models continue to increase in size, computational requirements and memory demands for AI also continue to grow. These challenges have inspired the development of distributed algorithm and circuit-based optimization techniques that enable the ability to progressively scale models in multi-node environments, efficiently minimize neural network cost functions for faster convergence, and store more parameters into a set number of available resources. In our research project, we focus on parallel and distributed machine learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Natural Language Processing Techniques
MethodsFocus
