Memory Planning for Deep Neural Networks
Maksim Levental

TL;DR
This paper introduces MemoMalloc, a memory allocation technique for DNN inference that reduces latency caused by mutex contention, achieving up to 40% faster performance with moderate memory overhead.
Contribution
The paper presents MemoMalloc, a novel runtime and static analysis-based memory planning method for DNN inference that significantly improves latency.
Findings
MemoMalloc reduces DNN inference latency by up to 40%.
It outperforms existing general-purpose memory allocators.
The approach balances latency improvements with moderate memory increases.
Abstract
We study memory allocation patterns in DNNs during inference, in the context of large-scale systems. We observe that such memory allocation patterns, in the context of multi-threading, are subject to high latencies, due to \texttt{mutex} contention in the system memory allocator. Latencies incurred due to such \texttt{mutex} contention produce undesirable bottlenecks in user-facing services. Thus, we propose a "memorization" based technique, \texttt{MemoMalloc}, for optimizing overall latency, with only moderate increases in peak memory usage. Specifically, our technique consists of a runtime component, which captures all allocations and uniquely associates them with their high-level source operation, and a static analysis component, which constructs an efficient allocation "plan". We present an implementation of \texttt{MemoMalloc} in the PyTorch deep learning framework and evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques
