PRIMAL: Processing-In-Memory Based Low-Rank Adaptation for LLM Inference Accelerator
Yue Jiet Chong, Yimin Wang, Zhen Wu, Xuanyao Fong

TL;DR
PRIMAL is a novel processing-in-memory accelerator designed for efficient LLM inference with low-rank adaptation, achieving significant improvements in throughput and energy efficiency over traditional GPU-based systems.
Contribution
It introduces a PIM-based architecture with innovative SRAM reprogramming and dataflow optimization for low-rank LLM inference acceleration.
Findings
1.5x throughput improvement over NVIDIA H100
25x energy efficiency gain
Effective pipelined LoRA updates
Abstract
This paper presents PRIMAL, a processing-in-memory (PIM) based large language model (LLM) inference accelerator with low-rank adaptation (LoRA). PRIMAL integrates heterogeneous PIM processing elements (PEs), interconnected by 2D-mesh inter-PE computational network (IPCN). A novel SRAM reprogramming and power gating (SRPG) scheme enables pipelined LoRA updates and sub-linear power scaling by overlapping reconfiguration with computation and gating idle resources. PRIMAL employs optimized spatial mapping and dataflow orchestration to minimize communication overhead, and achieves throughput and energy efficiency over NVIDIA H100 with LoRA rank 8 (Q,V) on Llama-13B.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Embedded Systems Design Techniques · Parallel Computing and Optimization Techniques
