DVFS-Aware DNN Inference on GPUs: Latency Modeling and Performance Analysis
Yunchu Han, Zhaojun Nan, Sheng Zhou, Zhisheng Niu

TL;DR
This paper presents a DVFS-aware latency model for GPU-based DNN inference, significantly improving accuracy over CPU-based models and enabling substantial reductions in inference time and energy consumption.
Contribution
The paper introduces a novel GPU-specific DVFS-aware inference time model, validated through extensive experiments and outperforming CPU-based models in optimizing DNN inference.
Findings
Achieves at least 66% reduction in inference time.
Reduces energy consumption by at least 69%.
Improves partition policies for cooperative inference.
Abstract
The rapid development of deep neural networks (DNNs) is inherently accompanied by the problem of high computational costs. To tackle this challenge, dynamic voltage frequency scaling (DVFS) is emerging as a promising technology for balancing the latency and energy consumption of DNN inference by adjusting the computing frequency of processors. However, most existing models of DNN inference time are based on the CPU-DVFS technique, and directly applying the CPU-DVFS model to DNN inference on GPUs will lead to significant errors in optimizing latency and energy consumption. In this paper, we propose a DVFS-aware latency model to precisely characterize DNN inference time on GPUs. We first formulate the DNN inference time based on extensive experiment results for different devices and analyze the impact of fitting parameters. Then by dividing DNNs into multiple blocks and obtaining the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Advanced Neural Network Applications · Machine Learning and ELM
