T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives
Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena and, Matthew D. Sinclair

TL;DR
T3 introduces a hardware-software co-designed approach to transparently overlap serialized communication with computation in large language model training, significantly improving efficiency and scaling performance.
Contribution
T3 proposes a novel hardware-software co-design that transparently overlaps communication and computation, reducing resource contention and improving training efficiency for large models.
Findings
Speeds up communication-heavy sublayers by 30% on average
Reduces data movement by 22% on average
Benefits persist in models with up to 500 billion parameters
Abstract
Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serialize communication with model execution. One approach to hide this serialized communication is to interleave it with the producer operation (of the communicated data) in a fine-grained manner. However, this fine-grained interleaving of communication and computation in software can be difficult. Furthermore, as with any concurrent execution, it requires compute and memory resources to be shared between computation and communication, causing resource contention that reduces overlapping efficacy.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModular Robots and Swarm Intelligence
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Residual Connection · Dropout · Layer Normalization · Multi-Head Attention · Adam · Softmax · Dense Connections
