Exploiting Block Coordinate Descent for Cost-Effective LLM Model Training
Zeyu Liu, Yan Li, Yunquan Zhang, Boyang Zhang, Guoyong Jiang, Xin Zhang, Limin Xiao, Weifeng Zhang, Daning Cheng

TL;DR
This paper introduces a block coordinate descent-based training framework for large language models that significantly reduces costs and hardware requirements while maintaining or improving accuracy.
Contribution
The authors develop a cost-effective BCD-based training method enabling large-scale LLM training on consumer-grade GPUs with comparable or better performance.
Findings
Training cost reduced to 33% on A100/A800 and 2.6% on RTX 4090.
Enables training of large models on RTX 4090 without performance loss.
Achieves comparable or better accuracy with lower GPU consumption.
Abstract
Training large language models typically demands extensive GPU memory and substantial financial investment, which poses a barrier for many small- to medium-sized teams. In this paper, we propose a full-parameter pre-training and fine-tuning framework based on block coordinate descent (BCD), enhanced with engineering optimizations, to enable efficient training of large-scale models on cost-effective RTX 4090, A100 and A800 GPU clusters. Under identical hardware configurations, we reduce the training cost of a 7B model to 33% on A100/A800 and only 2.6% on RTX 4090, compared to standard full-parameter training. It also enables large models previously restricted to A100 clusters to be trained on RTX 4090 without degrading performance. BCD achieves comparable or better accuracy than full-parameter and fine-tuning methods at most cases, with lower GPU consumption and improved hardware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
