Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing
Khyati Kiyawat, Zhenxing Fan, Yasas Seneviratne, Morteza Baradaran, Akhil Shekar, Zihan Xia, Mingu Kang, Kevin Skadron

TL;DR
Sangam introduces a chiplet-based DRAM-PIM accelerator with CXL integration that significantly accelerates large language model inference and reduces energy consumption by overcoming traditional in-memory processing limitations.
Contribution
It proposes a novel chiplet-based architecture that decouples logic and memory, enabling advanced processing capabilities within DRAM modules for LLM inference acceleration.
Findings
Achieves up to 4.22x speedup in query latency
Provides over 9x increase in decoding throughput
Offers significant energy savings compared to H100 GPU
Abstract
Large Language Models (LLMs) are becoming increasingly data-intensive due to growing model sizes, and they are becoming memory-bound as the context length and, consequently, the key-value (KV) cache size increase. Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations with low operational intensity (OI), making it well-suited for processing-in-memory (PIM) approaches. However, existing in/near-memory solutions face critical limitations such as reduced memory capacity due to the high area cost of integrating processing elements (PEs) within DRAM chips, and limited PE capability due to the constraints of DRAM fabrication technology. This work presents a chiplet-based memory module that addresses these limitations by decoupling logic and memory into chiplets fabricated in heterogeneous technology nodes and connected via an interposer. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Network Packet Processing and Optimization
