Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

Jinrui Zhang; Chaodong Xiao; Aoqi Wu; Xindong Zhang; Lei Zhang

arXiv:2602.11543·cs.CL·May 5, 2026

Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, Lei Zhang

PDF

1 Repo 3 Models

TL;DR

This paper introduces SPES, a memory-efficient decentralized framework for pretraining large mixture-of-experts language models across distributed GPUs, reducing memory use and communication costs while maintaining competitive performance.

Contribution

The authors propose a novel decentralized training method for MoE LLMs that trains only subsets of experts per node and introduces expert-merging warm-up, enabling training of large models with less memory and communication.

Findings

01

Trained a 2B-parameter MoE LLM on 16 GPUs with competitive results.

02

Successfully scaled to 7B and 9B models matching centralized baselines.

03

Achieved training over internet connections with reduced memory footprint.

Abstract

Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zjr2000/SPES
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.