FluidML: Fast and Memory Efficient Inference Optimization
Jinjie Liu, Hang Qiu

TL;DR
FluidML is a versatile framework that optimizes inference speed and memory usage for large machine learning models on edge devices, enabling more efficient deployment in resource-constrained environments.
Contribution
FluidML introduces a flexible runtime memory management and optimization framework that significantly improves inference efficiency for complex models on various platforms.
Findings
Up to 25.38% reduction in inference latency
Up to 41.47% reduction in peak memory usage
Applicable across different hardware platforms
Abstract
Machine learning models deployed on edge devices have enabled numerous exciting new applications, such as humanoid robots, AR glasses, and autonomous vehicles. However, the computing resources available on these edge devices are not catching up with the ever-growing number of parameters in these models. As the models become bigger and more complicated, the novel yet sophisticated structure challenges the inference runtime optimization. We present FluidML, a generic runtime memory management and optimization framework that can flexibly transform the model execution blueprint to achieve faster and more memory-efficient inference. Evaluations across different platforms show that FluidML can consistently reduce the end-to-end inference latency by up to 25.38% for popular language models and reduce peak memory usage by up to 41.47%, compared to state-of-the-art approaches. FluidML is of ~30K…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Time Series Analysis and Forecasting · Parallel Computing and Optimization Techniques
