Taming Asynchronous CPU-GPU Coupling for Frequency-aware Latency Estimation on Mobile Edge
Jiesong Chen, Jun You, Zhidan Liu, Zhenjiang Li

TL;DR
This paper presents FLAME, a novel method for accurately estimating inference latency across CPU-GPU frequency combinations in mobile edge devices, addressing the challenges posed by asynchronous coupling and dynamic voltage scaling.
Contribution
FLAME introduces a layer-wise modeling approach that captures asynchronous interactions, enabling fast and precise latency profiling for diverse models including SLMs.
Findings
FLAME reduces profiling time from hours to minutes for DNNs.
It achieves small estimation errors across frequency ranges.
Outperforms state-of-the-art in deadline-aware DVFS applications.
Abstract
Precise estimation of model inference latency is crucial for time-critical mobile edge applications, enabling devices to calculate latency margins against deadlines and trade them for enhanced model performance or resource savings. However, the ubiquity of Dynamic Voltage and Frequency Scaling (DVFS) renders traditional static profiling invalid in real-world deployments, as inference latency fluctuates with varying processor (CPU and GPU) frequencies. While extensive profiling across frequency combinations is theoretically possible, it is prohibitively expensive, particularly for emerging Small Language Models (SLMs), where variable context lengths explode the profiling up to days. We observe that simple analytic scaling fails to predict these fluctuations due to the complex asynchronous coupling between CPU (kernel launching) and GPU (execution). In this paper, we introduce FLAME to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
