Enabling On-Device Smartphone GPU based Training: Lessons Learned
Anish Das, Young D. Kwon, Jagmohan Chauhan, Cecilia Mascolo

TL;DR
This paper investigates the feasibility of on-device DNN training on smartphones using GPUs, identifies key bottlenecks, and provides optimization insights and practical guidelines for future development.
Contribution
It offers the first detailed analysis of on-device training on smartphones' GPUs, highlighting bottlenecks and proposing kernel optimizations.
Findings
GPU training is slower than CPU due to memory bottlenecks.
Kernel optimizations doubled GPU performance (40-70 GFLOPs).
Memory bandwidth limits data movement, affecting training speed.
Abstract
Deep Learning (DL) has shown impressive performance in many mobile applications. Most existing works have focused on reducing the computational and resource overheads of running Deep Neural Networks (DNN) inference on resource-constrained mobile devices. However, the other aspect of DNN operations, i.e. training (forward and backward passes) on smartphone GPUs, has received little attention thus far. To this end, we conduct an initial analysis to examine the feasibility of on-device training on smartphones using mobile GPUs. We first employ the open-source mobile DL framework (MNN) and its OpenCL backend for running compute kernels on GPUs. Next, we observed that training on CPUs is much faster than on GPUs and identified two possible bottlenecks related to this observation: (i) computation and (ii) memory bottlenecks. To solve the computation bottleneck, we optimize the OpenCL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
