TL;DR
This paper introduces CAAT-Net, a communication-aware architecture for tensor-parallelism in large language models, which reduces synchronization bandwidth by up to 50% without sacrificing accuracy, and improves training and inference efficiency.
Contribution
The paper proposes CAAT-Net, a novel approach that minimizes activation synchronization in tensor-parallel LLM training, demonstrating significant bandwidth reduction and scalability.
Findings
Tensor-parallel communication can be reduced by up to 50%.
No significant drop in pretraining accuracy with reduced communication.
Training and inference are accelerated across various model sizes.
Abstract
Training and inference of Large Language Models (LLMs) with tensor-parallelism requires substantial communication to synchronize activations. Our findings suggest that with a few minor adjustments to current practices, LLMs can be trained without fully synchronizing activations, reducing bandwidth demands. We name this "Communication-Aware Architecture for Tensor-parallelism" (CAAT-Net). We train a 7B parameter CAAT-Net model and show that tensor-parallel communication can be reduced by up to 50% with no significant drop in pretraining accuracy across nearly all evaluated benchmarks. We also experiment with smaller 130M and 1.1B models to show the robustness and scalability of our method. We find that, in some scenarios, validation loss can even improve when reducing communication. Finally, we demonstrate how CAAT-Net accelerates both training and inference workloads across various…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
