T3: Transparent Tracking & Triggering for Fine-grained Overlap of   Compute & Collectives

Suchita Pati; Shaizeen Aga; Mahzabeen Islam; Nuwan Jayasena and; Matthew D. Sinclair

arXiv:2401.16677·cs.AR·January 31, 2024·1 cites

T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena and, Matthew D. Sinclair

PDF

Open Access

TL;DR

T3 introduces a hardware-software co-designed approach to transparently overlap serialized communication with computation in large language model training, significantly improving efficiency and scaling performance.

Contribution

T3 proposes a novel hardware-software co-design that transparently overlaps communication and computation, reducing resource contention and improving training efficiency for large models.

Findings

01

Speeds up communication-heavy sublayers by 30% on average

02

Reduces data movement by 22% on average

03

Benefits persist in models with up to 500 billion parameters

Abstract

Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModular Robots and Swarm Intelligence

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Layer Normalization · Multi-Head Attention · Adam · Softmax · Dense Connections