NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference
Mingbo Hao, Changwei Yan, Haoyu Cui, Zhihao Yan, Yizhi Ding, Zhangrui Qian, Weiwei Shan

TL;DR
NVLLM introduces a 3D NAND-centric architecture that significantly accelerates large language model inference on edge devices by offloading feed-forward computations to NAND storage and optimizing memory access.
Contribution
The paper presents NVLLM, a novel 3D NAND-based inference architecture that tightly integrates NAND storage with compute pipelines for efficient edge LLM inference.
Findings
Achieves up to 37.9× speedup over A800-based inference.
Provides up to 4.7× speedup over SSD-like designs.
Maintains only 2.7% CMOS area overhead.
Abstract
The rapid growth of LLMs demands high-throughput, memory-capacity-intensive inference on resource-constrained edge devices, where single-batch decoding remains fundamentally memory-bound. Existing out-of-core GPU-based and SSD-like accelerators are limited by DRAM-bound weight movement and inefficient storage access granularity. We present NVLLM, a 3D NAND-centric inference architecture that offloads feed-forward network (FFN) computation into the Flash while executing attention on lightweight CMOS logic with external DRAM. Through wafer-to-wafer stacking, NVLLM tightly integrates multi-plane 3D NAND with compute pipelines, error correction code (ECC) units, and buffers, enabling page-level FFN weight access without DRAM traversal. All GEMM/GEMV operations are decomposed into dot-product primitives executed by out-of-order PE lanes, operating directly on raw NAND reads with integrated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
