ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators
Guoqiang Zou, Wanyu Wang, Hao Zheng, Longxiang Yin, Yinhe Han

TL;DR
ODMA is a dynamic memory allocation strategy designed for LPDDR accelerators that improves memory utilization and throughput in large language model serving by addressing distribution drift and heavy-tailed request patterns.
Contribution
It introduces a predictor-based adaptive bucket partitioning and safety pool mechanism tailored for LPDDR hardware, overcoming static allocation limitations.
Findings
ODMA increases KV-cache utilization by up to 19.25%.
ODMA improves throughput (TPS) by 23-27%.
ODMA enhances prediction accuracy from 98.60% to 99.55%. on Alpaca.
Abstract
Existing memory management techniques severely hinder efficient Large Language Model serving on accelerators constrained by poor random-access bandwidth.While static pre-allocation preserves memory contiguity,it incurs significant overhead due to worst-case provisioning.Conversely,fine-grained paging mitigates this overhead but relies on HBM's high random-access tolerance, making it unsuitable for LPDDR systems where non-sequential access rapidly degrades bandwidth. Furthermore, prior works typically assume static distributions and HBM characteristics, thereby failing to resolve the critical fragmentation and bandwidth constraints inherent to LPDDR hardware. We present ODMA, an on-demand memory allocation strategy tailored for random-access-constrained accelerators, such as the Cambricon MLU series.ODMA advances generation-length prediction by addressing two critical limitations in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
