SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs

Jintao Zhang; Xuanyao Fong

arXiv:2604.07396·cs.AR·April 10, 2026

SHIELD: A Segmented Hierarchical Memory Architecture for Energy-Efficient LLM Inference on Edge NPUs

Jintao Zhang, Xuanyao Fong

PDF

TL;DR

SHIELD is a novel segmented eDRAM architecture that reduces energy consumption for LLM inference on edge NPUs by exploiting activation properties and lifecycle awareness.

Contribution

It introduces a lifecycle-aware segmented eDRAM design that selectively disables or relaxes refresh based on activation transientness and sensitivity.

Findings

01

Reduces eDRAM refresh energy by 35%

02

Maintains accuracy on WikiText-2, PIQA, and ARC-Easy datasets

03

Applicable across multiple LLMs and inference scenarios

Abstract

Large Language Model (LLM) inference on edge Neural Processing Units (NPUs) is fundamentally constrained by limited on-chip memory capacity. Although high-density embedded DRAM (eDRAM) is attractive for storing activation workspaces, its periodic refresh consumes substantial energy. Prior work has primarily focused on reducing off-chip traffic or optimizing refresh for persistent Key-Value (KV) caches, while transient and error-resilient Query and Attention Output (QO) activations are largely overlooked. We propose SHIELD, a lifecycle-aware segmented eDRAM architecture that jointly exploits temporal residency and bit-level sensitivity in bfloat16 (BF16) activations. SHIELD isolates the sign and exponent fields from the mantissa, disables refresh for transient QO mantissas, and applies relaxed refresh to persistent KV mantissas. Across multiple LLMs and inference scenarios, SHIELD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.