LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs
Erik Schultheis, Dan Alistarh

TL;DR
LLMQ is an efficient training framework enabling large language models (up to 32B parameters) to be trained on affordable consumer GPUs using low-precision techniques and various optimizations, matching the performance of high-end systems.
Contribution
The paper introduces LLMQ, a novel end-to-end implementation that allows training large language models on commodity GPUs with significant optimizations, without extra algorithmic approximations.
Findings
Able to train 7B models on a single 16GB GPU
Supports training 32B models on 4 RTX 4090 GPUs
Maintains around 50% FLOP utilization with 8-bit training
Abstract
We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low memory availability and slow communication compared to datacentre-grade GPUs. Consequently, we showcase a range of optimizations that target these bottlenecks, including activation checkpointing, offloading, and copy-engine based collectives. LLMQ is able to train or fine-tune a 7B model on a single 16GB mid-range gaming card, or a 32B model on a workstation equipped with 4 RTX 4090s. This is achieved while executing a standard 8-bit training pipeline, without additional algorithmic approximations, and maintaining FLOP utilization of around 50%. The efficiency of LLMQ rivals that of production-scale systems on much more expensive cloud-grade GPUs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Natural Language Processing Techniques
