LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs

Erik Schultheis; Dan Alistarh

arXiv:2512.15306·cs.DC·December 18, 2025

LLMQ: Efficient Lower-Precision Pretraining for Consumer GPUs

Erik Schultheis, Dan Alistarh

PDF

Open Access

TL;DR

LLMQ is an efficient training framework enabling large language models (up to 32B parameters) to be trained on affordable consumer GPUs using low-precision techniques and various optimizations, matching the performance of high-end systems.

Contribution

The paper introduces LLMQ, a novel end-to-end implementation that allows training large language models on commodity GPUs with significant optimizations, without extra algorithmic approximations.

Findings

01

Able to train 7B models on a single 16GB GPU

02

Supports training 32B models on 4 RTX 4090 GPUs

03

Maintains around 50% FLOP utilization with 8-bit training

Abstract

We present LLMQ, an end-to-end CUDA/C++ implementation for medium-sized language-model training, e.g. 3B to 32B parameters, on affordable, commodity GPUs. These devices are characterized by low memory availability and slow communication compared to datacentre-grade GPUs. Consequently, we showcase a range of optimizations that target these bottlenecks, including activation checkpointing, offloading, and copy-engine based collectives. LLMQ is able to train or fine-tune a 7B model on a single 16GB mid-range gaming card, or a 32B model on a workstation equipped with 4 RTX 4090s. This is achieved while executing a standard 8-bit training pipeline, without additional algorithmic approximations, and maintaining FLOP utilization of around 50%. The efficiency of LLMQ rivals that of production-scale systems on much more expensive cloud-grade GPUs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Natural Language Processing Techniques