Code generation and runtime techniques for enabling data-efficient deep learning training on GPUs
Kun Wu

TL;DR
This paper introduces novel runtime and code generation techniques integrated into PyTorch to improve data efficiency in GPU-based deep learning training, addressing memory and data transfer bottlenecks for GNNs and LLMs.
Contribution
It proposes new methods like PyTorch-Direct, Hector IR, and SSDTrain to optimize data transfers and memory management, enhancing training efficiency for large models.
Findings
PyTorch-Direct improves PCIe data transfer efficiency.
Hector IR enables high-level abstractions for GNNs.
SSDTrain effectively offloads data in LLM training.
Abstract
As deep learning models scale, their training cost has surged significantly. Due to both hardware advancements and limitations in current software stacks, the need for data efficiency has risen. Data efficiency refers to the effective hiding of data access latency and the avoidance of unnecessary data movements. Major challenges arise from the growing disparity between GPU memory bandwidth and computational throughput, imminent GPU memory capacity limitations, and inefficiencies in the PyTorch software stack, including a lack of device-specific PCIe transfer optimizations and high-level domain-specific abstractions. To effectively mitigate these data inefficiencies for deep learning training, this dissertation analyzes data inefficiency in representative deep training tasks, specifically in graph neural networks (GNNs) and large language models (LLMs). It then proposes novel runtime and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Parallel Computing and Optimization Techniques
