Tesseract: Parallelize the Tensor Parallelism Efficiently
Boxiang Wang, Qifan Xu, Zhengda Bian, Yang You

TL;DR
Tesseract introduces a novel tensor parallelism method that significantly improves efficiency and scalability for training large deep learning models on limited GPU resources by reducing communication overhead and increasing memory capacity.
Contribution
The paper proposes Tesseract, a new tensor parallelism design that enhances scalability and efficiency by adding a novel dimension, reducing communication costs, and increasing memory capacity compared to previous methods.
Findings
Achieves 1.38x and 1.53x speedups over 1-D and 2-D methods.
Realizes up to 4.0x inference speedup and 3.4x throughput improvement.
Reduces communication overhead, enabling training of larger models with limited GPU resources.
Abstract
Together with the improvements in state-of-the-art accuracies of various tasks, deep learning models are getting significantly larger. However, it is extremely difficult to implement these large models because limited GPU memory makes it impossible to fit large models into a single GPU or even a GPU server. Besides, it is highly necessary to reduce the training time for large models. Previous methods like Megatron-LM implemented a 1-Dimensional distributed method to use GPUs to speed up the training. However, these methods have a high communication overhead and a low scaling efficiency on large-scale clusters. To solve these problems, we propose Tesseract, a highly scalable tensor parallelism with a novel design. It increases efficiency by reducing communication overhead and lowers the memory required for each GPU. By introducing the novel dimension into tensor parallelism, Tesseract…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
