FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations

Zhihao Shu; Md Musfiqur Rahman Sanim; Hangyu Zheng; Kunxiong Zhu; Miao Yin; Gagan Agrawal; Wei Niu

arXiv:2602.15379·cs.DC·February 18, 2026

FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations

Zhihao Shu, Md Musfiqur Rahman Sanim, Hangyu Zheng, Kunxiong Zhu, Miao Yin, Gagan Agrawal, Wei Niu

PDF

Open Access

TL;DR

FlashMem is a memory streaming framework that enables efficient execution of large and multiple DNNs on mobile GPUs by dynamically streaming weights, significantly reducing memory usage and inference latency.

Contribution

It introduces a novel memory streaming approach with static scheduling and dynamic on-demand loading, surpassing preloading strategies for modern DNN workloads.

Findings

01

Achieves 2.0x to 8.4x memory reduction

02

Attains 1.7x to 75.0x speedup over existing frameworks

03

Supports large-scale and multi-DNN workloads on mobile GPUs

Abstract

The increasing size and complexity of modern deep neural networks (DNNs) pose significant challenges for on-device inference on mobile GPUs, with limited memory and computational resources. Existing DNN acceleration frameworks primarily deploy a weight preloading strategy, where all model parameters are loaded into memory before execution on mobile GPUs. We posit that this approach is not adequate for modern DNN workloads that comprise very large model(s) and possibly execution of several distinct models in succession. In this work, we introduce FlashMem, a memory streaming framework designed to efficiently execute large-scale modern DNNs and multi-DNN workloads while minimizing memory consumption and reducing inference latency. Instead of fully preloading weights, FlashMem statically determines model loading schedules and dynamically streams them on demand, leveraging 2.5D texture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · IoT and Edge/Fog Computing · Big Data and Digital Economy