AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Zhongkai Yu; Haotian Ye; Chenyang Zhou; Ohm Rishabh Venkatachalam; Zaifeng Pan; Zhengding Hu; Junsung Kim; Won Woo Ro; Po-An Tsai; Shuyi Pei; Yangwook Kang; Yufei Ding

arXiv:2604.26103·cs.AR·May 1, 2026

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Zhongkai Yu, Haotian Ye, Chenyang Zhou, Ohm Rishabh Venkatachalam, Zaifeng Pan, Zhengding Hu, Junsung Kim, Won Woo Ro, Po-An Tsai, Shuyi Pei, Yangwook Kang, Yufei Ding

PDF

TL;DR

AMMA is a novel multi-chiplet, memory-centric architecture designed to significantly reduce latency and energy consumption in long-context attention serving for large language models.

Contribution

It introduces a memory-centric design with HBM-PNM cubes, a specialized microarchitecture, hybrid parallelism, and optimized data flow to improve performance over GPU-based systems.

Findings

01

Achieves 15.5X lower attention latency than NVIDIA H100.

02

Reduces energy consumption by 6.9X compared to NVIDIA H100.

03

Provides design guidance through a comprehensive design-space exploration.

Abstract

All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still treat the GPU as the central hub for cross-device communication. Yet the GPU's compute-rich architecture is fundamentally mismatched with the memory-bound nature of decode-phase attention, inflating serving latency while wasting power and die area on idle compute units. The problem is compounded as reasoning and agentic workloads push context lengths toward one million tokens, making attention latency the primary user-facing bottleneck. To address these inefficiencies, we present AMMA, a multi-chiplet, memory-centric architecture for low-latency long-context attention. AMMA replaces GPU compute dies with HBM-PNM cubes, roughly doubling the available memory bandwidth to better serve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.