The Case for Co-Designing Model Architectures with Hardware
Quentin Anthony, Jacob Hatef, Deepak Narayanan, Stella Biderman, Stas, Bekman, Junqi Yin, Aamir Shafi, Hari Subramoni, Dhabaleswar Panda

TL;DR
This paper emphasizes the importance of co-designing deep learning model architectures with hardware considerations, demonstrating that optimized shapes can significantly boost GPU training throughput without sacrificing accuracy.
Contribution
It provides practical guidelines for designing transformer models optimized for GPU hardware, highlighting the impact of model shape on computational efficiency.
Findings
Optimized model shapes increase throughput by up to 39%.
Guidelines improve GPU training efficiency for transformer models.
Model accuracy is preserved despite shape optimizations.
Abstract
While GPUs are responsible for training the vast majority of state-of-the-art deep learning models, the implications of their architecture are often overlooked when designing new deep learning (DL) models. As a consequence, modifying a DL model to be more amenable to the target hardware can significantly improve the runtime performance of DL training and inference. In this paper, we provide a set of guidelines for users to maximize the runtime performance of their transformer models. These guidelines have been created by carefully considering the impact of various model hyperparameters controlling model shape on the efficiency of the underlying computation kernels executed on the GPU. We find the throughput of models with efficient model shapes is up to 39\% higher while preserving accuracy compared to models with a similar number of parameters but with unoptimized shapes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning and Data Classification · Stochastic Gradient Optimization Techniques
MethodsSparse Evolutionary Training
