Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models
Shang Wang, Peiming Yang, Yuxuan Zheng, Xin Li, Gennady Pekhimenko

TL;DR
This paper introduces HFTA, a framework extension that horizontally fuses multiple deep learning training jobs to improve hardware utilization and significantly increase training throughput on accelerators.
Contribution
The paper proposes HFTA, a novel method to fuse multiple DL training jobs horizontally, enhancing hardware efficiency and throughput for repetitive workloads.
Findings
HFTA achieves up to 15.1x higher training throughput.
HFTA effectively utilizes GPU and TPU resources.
Horizontal fusion is mathematically equivalent to optimized operators.
Abstract
Driven by the tremendous effort in researching novel deep learning (DL) algorithms, the training cost of developing new models increases staggeringly in recent years. We analyze GPU cluster usage statistics from a top research institute for more insights into the hardware efficiency achieved by typical DL training jobs. Our study reveals that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively (e.g., for hyper-parameter tuning) while severely under-utilizing the hardware. Fortunately, we observe that such workloads have the following unique characteristics: (i) the models among jobs often have the same types of operators with the same shapes, and (ii) the inter-model horizontal fusion of such operators is mathematically equivalent to other already well-optimized operators. Thus, to help DL researchers and practitioners…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Stochastic Gradient Optimization Techniques
