MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

Haoran Wu; Zeyu Cao; Yao Lai; Binglei Lou; Jiayi Nie; Can Xiao; Timi Adeniran; Przemyslaw Forys; Kauser Johar; Catriona Wright; Junyi Liu; Kai Shi; Nicholas D. Lane; Rika Antonova; Jianyi Cheng; Timothy Jones; Aaron Zhao; Robert Mullins

arXiv:2604.16007·cs.AR·April 20, 2026

MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

Haoran Wu, Zeyu Cao, Yao Lai, Binglei Lou, Jiayi Nie, Can Xiao, Timi Adeniran, Przemyslaw Forys, Kauser Johar, Catriona Wright, Junyi Liu, Kai Shi, Nicholas D. Lane, Rika Antonova, Jianyi Cheng, Timothy Jones, Aaron Zhao, Robert Mullins

PDF

TL;DR

MemExplorer is a memory system synthesizer that optimizes heterogeneous memory architectures for agentic LLM inference workloads, improving energy and power efficiency in multi-device NPU systems.

Contribution

It introduces a unified modeling approach and automated design space exploration for heterogeneous memory systems in next-generation NPUs.

Findings

01

Achieves up to 2.3x higher energy efficiency than baseline NPU.

02

Delivers up to 3.23x higher energy efficiency than H100 in prefill.

03

Provides up to 2.72x higher power efficiency over H100 in decode setting.

Abstract

Emerging agentic LLM workloads are driving rapidly growing demand on both memory capacity and bandwidth, with different phases of inference (e.g., prefill and decode) imposing distinct requirements. Industry is responding by composing heterogeneous accelerators into single interconnected systems, as exemplified by NVIDIA's Vera Rubin platform, where each device brings its own memory architecture. This heterogeneity is further compounded by a widening landscape of available memory technologies: high-density on-chip SRAM, HBM, LPDDR, GDDR, and emerging options such as high-bandwidth flash (HBF), each offering different capacity, bandwidth, and power trade-offs. Identifying the right memory architecture for next-generation inference accelerators requires navigating a vast and rapidly evolving design space, in which the interplay between workload characteristics, NPU design dimensions,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.