RPU -- A Reasoning Processing Unit
Matthew Adiletta, Gu-Yeon Wei, David Brooks

TL;DR
The paper introduces the RPU, a novel chiplet-based architecture optimized for memory bandwidth-bound LLM inference workloads, significantly improving latency and throughput compared to existing GPU systems.
Contribution
The RPU architecture combines a capacity-optimized high-bandwidth memory, a scalable chiplet design, and a decoupled microarchitecture to address the memory wall in LLM inference.
Findings
RPU achieves up to 45.3x lower latency than H100.
RPU provides 18.6x higher throughput on Llama3-405B.
Simulation results demonstrate substantial performance improvements.
Abstract
Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning LLM applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand significantly higher memory bandwidth. As a result, system utilization drops and energy per inference rises, highlighting the need for an optimized system architecture for scalable memory bandwidth. To address these challenges we present the Reasoning Processing Unit (RPU), a chiplet-based architecture designed to address the challenges of the modern memory wall. RPU introduces: (1) A Capacity-Optimized High-Bandwidth Memory (HBM-CO) that trades capacity for lower energy and cost; (2) a scalable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Big Data and Digital Economy
