BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
Zhengyang Wang, Ziyue Liu, Ruijie Zhang, Avinash Maurya, Paul Hovland, Bogdan Nicolae, Franck Cappello, Zheng Zhang

TL;DR
BOOST is a training framework that enhances low-rank large language model training efficiency by introducing bottleneck-aware tensor parallelism and various optimizations, significantly reducing training time and communication costs.
Contribution
The paper presents a novel parallelism method and optimizations specifically designed for low-rank architectures, enabling scalable and efficient training of large language models.
Findings
Achieves 1.46-1.91× speedup over full-rank baselines.
Achieves 1.87-2.27× speedup over naive 3D parallelism.
Improves GPU utilization and reduces communication overhead.
Abstract
The scale of transformer model pre-training is constrained by the increasing computation and communication cost. Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint with minimum impact on accuracy. Despite algorithmic efficiency, bottleneck architectures scale poorly under standard tensor parallelism. Simply applying 3D parallelism designed for full-rank methods leads to excessive communication and poor GPU utilization. To address this limitation, we propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures. BOOST introduces a novel Bottleneck-aware Tensor Parallelism, and combines optimizations such as online-RMSNorm, linear layer grouping, and low-rank activation checkpointing to achieve end-to-end training speedup. Evaluations on different low-rank bottleneck…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
