NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

Mingbo Hao; Changwei Yan; Haoyu Cui; Zhihao Yan; Yizhi Ding; Zhangrui Qian; Weiwei Shan

arXiv:2604.25699·cs.AR·April 29, 2026

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

Mingbo Hao, Changwei Yan, Haoyu Cui, Zhihao Yan, Yizhi Ding, Zhangrui Qian, Weiwei Shan

PDF

TL;DR

NVLLM introduces a 3D NAND-centric architecture that significantly accelerates large language model inference on edge devices by offloading feed-forward computations to NAND storage and optimizing memory access.

Contribution

The paper presents NVLLM, a novel 3D NAND-based inference architecture that tightly integrates NAND storage with compute pipelines for efficient edge LLM inference.

Findings

01

Achieves up to 37.9× speedup over A800-based inference.

02

Provides up to 4.7× speedup over SSD-like designs.

03

Maintains only 2.7% CMOS area overhead.

Abstract

The rapid growth of LLMs demands high-throughput, memory-capacity-intensive inference on resource-constrained edge devices, where single-batch decoding remains fundamentally memory-bound. Existing out-of-core GPU-based and SSD-like accelerators are limited by DRAM-bound weight movement and inefficient storage access granularity. We present NVLLM, a 3D NAND-centric inference architecture that offloads feed-forward network (FFN) computation into the Flash while executing attention on lightweight CMOS logic with external DRAM. Through wafer-to-wafer stacking, NVLLM tightly integrates multi-plane 3D NAND with compute pipelines, error correction code (ECC) units, and buffers, enabling page-level FFN weight access without DRAM traversal. All GEMM/GEMV operations are decomposed into dot-product primitives executed by out-of-order PE lanes, operating directly on raw NAND reads with integrated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.