FluidML: Fast and Memory Efficient Inference Optimization

Jinjie Liu; Hang Qiu

arXiv:2411.09242·cs.LG·November 15, 2024

FluidML: Fast and Memory Efficient Inference Optimization

Jinjie Liu, Hang Qiu

PDF

Open Access

TL;DR

FluidML is a versatile framework that optimizes inference speed and memory usage for large machine learning models on edge devices, enabling more efficient deployment in resource-constrained environments.

Contribution

FluidML introduces a flexible runtime memory management and optimization framework that significantly improves inference efficiency for complex models on various platforms.

Findings

01

Up to 25.38% reduction in inference latency

02

Up to 41.47% reduction in peak memory usage

03

Applicable across different hardware platforms

Abstract

Machine learning models deployed on edge devices have enabled numerous exciting new applications, such as humanoid robots, AR glasses, and autonomous vehicles. However, the computing resources available on these edge devices are not catching up with the ever-growing number of parameters in these models. As the models become bigger and more complicated, the novel yet sophisticated structure challenges the inference runtime optimization. We present FluidML, a generic runtime memory management and optimization framework that can flexibly transform the model execution blueprint to achieve faster and more memory-efficient inference. Evaluations across different platforms show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models and reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches. FluidML is of ~30K…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Time Series Analysis and Forecasting · Parallel Computing and Optimization Techniques