RPU -- A Reasoning Processing Unit

Matthew Adiletta; Gu-Yeon Wei; David Brooks

arXiv:2602.18568·cs.AR·February 25, 2026

RPU -- A Reasoning Processing Unit

Matthew Adiletta, Gu-Yeon Wei, David Brooks

PDF

Open Access

TL;DR

The paper introduces the RPU, a novel chiplet-based architecture optimized for memory bandwidth-bound LLM inference workloads, significantly improving latency and throughput compared to existing GPU systems.

Contribution

The RPU architecture combines a capacity-optimized high-bandwidth memory, a scalable chiplet design, and a decoupled microarchitecture to address the memory wall in LLM inference.

Findings

01

RPU achieves up to 45.3x lower latency than H100.

02

RPU provides 18.6x higher throughput on Llama3-405B.

03

Simulation results demonstrate substantial performance improvements.

Abstract

Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning LLM applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand significantly higher memory bandwidth. As a result, system utilization drops and energy per inference rises, highlighting the need for an optimized system architecture for scalable memory bandwidth. To address these challenges we present the Reasoning Processing Unit (RPU), a chiplet-based architecture designed to address the challenges of the modern memory wall. RPU introduces: (1) A Capacity-Optimized High-Bandwidth Memory (HBM-CO) that trades capacity for lower energy and cost; (2) a scalable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Big Data and Digital Economy