Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs
Marco Kurzynski, Shaizeen Aga, Di Wu

TL;DR
This paper investigates how thermal imbalance in multi-GPU systems causes performance variation during LLM training, and proposes models and mitigation techniques to improve efficiency and reduce costs.
Contribution
It introduces the Lit Silicon effect, analyzes its impact on GPU performance, and develops detection and power management strategies to mitigate thermal imbalance issues.
Findings
Thermal imbalance causes node-level GPU performance variation.
Mitigation techniques can improve performance by up to 6%.
Power management solutions can save significant energy costs.
Abstract
GPU systems are increasingly powering modern datacenters at scale. Despite being highly performant, GPU systems can exhibit performance variation at the node and cluster levels. Such performance variation can significantly impact both high-performance computing and artificial intelligence workloads, such as cutting-edge large language models (LLMs). In this work, we analyze the performance of a single-node multi-GPU system running LLM training, and observe that the kernel-level performance variation is highly correlated with concurrent computation and communication (C3), a technique to overlap computation and communication across GPUs for performance gains. We then take a further step to reason that thermally induced straggling coupled with C3 impacts performance variation, which we coin the Lit Silicon effect. More specifically, Lit Silicon describes that in a multi-GPU node, thermal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
