Computation vs. Communication Scaling for Future Transformers on Future Hardware
Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and, Matthew D. Sinclair

TL;DR
This paper analyzes how compute and communication scale in future Transformer models and hardware, revealing that communication will become increasingly significant and challenging as models grow larger.
Contribution
It provides a comprehensive multi-axial analysis of compute versus communication scaling for future Transformers on evolving hardware, including empirical projections and cost reduction methods.
Findings
Compute generally outpaces communication as models scale.
Communication will constitute 40-75% of runtime in future models.
Hidden communication in current models may become unhidden in larger future models.
Abstract
Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques. Accordingly, as with other distributed computing scenarios, it is important to understand how will compute and communication scale relative to one another as models scale and hardware evolves? A careful study which answers this question can better guide the design of future systems which can efficiently train future large models. Accordingly, this work provides a comprehensive multi-axial (algorithmic, empirical, hardware evolution) analysis of compute vs. communication (Comp-vs.-Comm) scaling for future Transformer models on future hardware. First, our algorithmic analysis shows that compute generally enjoys an edge over communication as models scale. However, since memory capacity scales slower than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Neural Network Applications · Advanced Memory and Neural Computing
