Accelerating Mobile Inference through Fine-Grained CPU-GPU Co-Execution
Zhuojin Li, Marco Paolieri, Leana Golubchik

TL;DR
This paper introduces a method to accelerate neural network inference on mobile devices by enabling efficient CPU-GPU co-execution using lightweight synchronization and machine learning-based execution time prediction, resulting in significant speedups.
Contribution
It proposes a novel lightweight synchronization mechanism and ML models for accurate execution time prediction to optimize CPU-GPU collaborative inference on mobile devices.
Findings
Achieves up to 1.89x speedup for linear layers
Achieves up to 1.75x speedup for convolutional layers
Close to maximum possible speedups demonstrated on mobile platforms
Abstract
Deploying deep neural networks on mobile devices is increasingly important but remains challenging due to limited computing resources. On the other hand, their unified memory architecture and narrower gap between CPU and GPU performance provide an opportunity to reduce inference latency by assigning tasks to both CPU and GPU. The main obstacles for such collaborative execution are the significant synchronization overhead required to combine partial results, and the difficulty of predicting execution times of tasks assigned to CPU and GPU (due to the dynamic selection of implementations and parallelism level). To overcome these obstacles, we propose both a lightweight synchronization mechanism based on OpenCL fine-grained shared virtual memory (SVM) and machine learning models to accurately predict execution times. Notably, these models capture the performance characteristics of GPU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
