Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing

Khyati Kiyawat; Zhenxing Fan; Yasas Seneviratne; Morteza Baradaran; Akhil Shekar; Zihan Xia; Mingu Kang; Kevin Skadron

arXiv:2511.12286·cs.AR·November 18, 2025

Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing

Khyati Kiyawat, Zhenxing Fan, Yasas Seneviratne, Morteza Baradaran, Akhil Shekar, Zihan Xia, Mingu Kang, Kevin Skadron

PDF

Open Access

TL;DR

Sangam introduces a chiplet-based DRAM-PIM accelerator with CXL integration that significantly accelerates large language model inference and reduces energy consumption by overcoming traditional in-memory processing limitations.

Contribution

It proposes a novel chiplet-based architecture that decouples logic and memory, enabling advanced processing capabilities within DRAM modules for LLM inference acceleration.

Findings

01

Achieves up to 4.22x speedup in query latency

02

Provides over 9x increase in decoding throughput

03

Offers significant energy savings compared to H100 GPU

Abstract

Large Language Models (LLMs) are becoming increasingly data-intensive due to growing model sizes, and they are becoming memory-bound as the context length and, consequently, the key-value (KV) cache size increase. Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations with low operational intensity (OI), making it well-suited for processing-in-memory (PIM) approaches. However, existing in/near-memory solutions face critical limitations such as reduced memory capacity due to the high area cost of integrating processing elements (PEs) within DRAM chips, and limited PE capability due to the constraints of DRAM fabrication technology. This work presents a chiplet-based memory module that addresses these limitations by decoupling logic and memory into chiplets fabricated in heterogeneous technology nodes and connected via an interposer. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Network Packet Processing and Optimization