RTP: Rethinking Tensor Parallelism with Memory Deduplication
Cheng Luo, Tianle Zhong, Geoffrey Fox

TL;DR
RTP introduces a memory deduplication approach for distributed neural network training, enabling larger models with near-linear scalability and efficient memory use, while maintaining performance comparable to existing methods.
Contribution
The paper presents RTP, a novel tensor parallelism method that leverages memory deduplication, customized communication, and the Flyweight Pattern to optimize distributed training of large models.
Findings
Memory consumption close to optimal during training.
Supports larger models with near-linear scalability.
Achieves performance comparable to Distributed Data Parallel.
Abstract
In the evolving landscape of neural network models, one prominent challenge stand out: the significant memory overheads associated with training expansive models. Addressing this challenge, this study delves deep into the Rotated Tensor Parallelism (RTP). RTP is an innovative approach that strategically focuses on memory deduplication in distributed training environments. It boasts of unique features like a customized communication primitive and the Flyweight Pattern initialization. Furthermore, RTP ensures a seamless overlap between partition computation and partition weight communication, optimizing the training process. Our empirical evaluations underscore RTP's efficiency, revealing that its memory consumption during distributed system training is remarkably close to the optimal - distributing the memory overhead of a single machine equitably among multiple machines. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques · Tensor decomposition and applications
