Memory Planning for Deep Neural Networks

Maksim Levental

arXiv:2203.00448·cs.LG·March 2, 2022

Memory Planning for Deep Neural Networks

Maksim Levental

PDF

Open Access

TL;DR

This paper introduces MemoMalloc, a memory allocation technique for DNN inference that reduces latency caused by mutex contention, achieving up to 40% faster performance with moderate memory overhead.

Contribution

The paper presents MemoMalloc, a novel runtime and static analysis-based memory planning method for DNN inference that significantly improves latency.

Findings

01

MemoMalloc reduces DNN inference latency by up to 40%.

02

It outperforms existing general-purpose memory allocators.

03

The approach balances latency improvements with moderate memory increases.

Abstract

We study memory allocation patterns in DNNs during inference, in the context of large-scale systems. We observe that such memory allocation patterns, in the context of multi-threading, are subject to high latencies, due to \texttt{mutex} contention in the system memory allocator. Latencies incurred due to such \texttt{mutex} contention produce undesirable bottlenecks in user-facing services. Thus, we propose a "memorization" based technique, \texttt{MemoMalloc}, for optimizing overall latency, with only moderate increases in peak memory usage. Specifically, our technique consists of a runtime component, which captures all allocations and uniquely associates them with their high-level source operation, and a static analysis component, which constructs an efficient allocation "plan". We present an implementation of \texttt{MemoMalloc} in the PyTorch deep learning framework and evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Advanced Neural Network Applications · Parallel Computing and Optimization Techniques