Performance of Small Language Model Pretraining on FABRIC: An Empirical Study
Praveen Rao

TL;DR
This study evaluates pretraining techniques for small language models on commodity GPU clusters, analyzing parallelism strategies and network effects to optimize training performance and resource usage.
Contribution
It provides a systematic approach for selecting pretraining methods for small LLMs considering hardware and network constraints, based on extensive empirical testing.
Findings
Alpa's execution plans outperform others in distributed settings.
Network latency significantly impacts pretraining efficiency.
Optimized parallelism strategies reduce training time and resource consumption.
Abstract
Large language models (LLMs) require enormous computing power to pretrain on massive datasets. When limited datasets are available, smaller-sized LLMs are better choice to pretrain (on user-specified datasets) by following the scaling laws of LLMs. Using pretrained models, vector embeddings can be generated for raw data and stored using vector databases to support modern AI applications and semantic search. In this work, we investigate the performance of pretraining techniques for smaller-sized LLMs on an experimental testbed (with commodity GPUs) available to academic users at no charge. We consider data parallelism, intra-operator parallelism, and inter-operator/pipeline parallelism, and their combinations for pretraining. We set up different GPU clusters with homogeneous and heterogeneous GPU hardware. Furthermore, we investigate the impact of network latency on pretraining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Big Data and Digital Economy · Topic Modeling
