Oases: Efficient Large-Scale Model Training on Commodity Servers via Overlapped and Automated Tensor Model Parallelism

Shengwei Li; Zhiquan Lai; Dongsheng Li; Yanqi Hao; Weijie Liu; Keshi Ge; Xiaoge Deng; Kai Lu

arXiv:2305.16121·cs.DC·July 1, 2025·1 cites

Oases: Efficient Large-Scale Model Training on Commodity Servers via Overlapped and Automated Tensor Model Parallelism

Shengwei Li, Zhiquan Lai, Dongsheng Li, Yanqi Hao, Weijie Liu, Keshi Ge, Xiaoge Deng, Kai Lu

PDF

Open Access

TL;DR

Oases introduces an automated tensor model parallelism method with overlapped communication and computation, significantly improving large-scale model training efficiency on commodity servers.

Contribution

The paper presents Oases, a novel automated TMP approach with overlapped communication, including a fine-grained schedule and a cost-aware planner for optimal partitioning.

Findings

01

Achieves 1.01-1.48x speedup over state-of-the-art methods.

02

Up to 1.95x speedup over Megatron.

03

Effective on various models and commodity clusters.

Abstract

Deep learning is experiencing a rise in large-scale models. Training large-scale models is costly, prompting researchers to train large-scale models on commodity servers that more researchers can access. The massive number of parameters necessitates the use of model parallelism training methods. Existing studies focus on training with pipeline model parallelism. However, the tensor model parallelism (TMP) is inevitable when the model size keeps increasing, where frequent data-dependent communication and computation operations significantly reduce the training efficiency. In this paper, we present Oases, an automated TMP method with overlapped communication to accelerate large-scale model training on commodity servers. Oases proposes a fine-grained training operation schedule to maximize overlapping communication and computation that have data dependence. Additionally, we design the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Tensor decomposition and applications · Parallel Computing and Optimization Techniques