PRIMAL: Processing-In-Memory Based Low-Rank Adaptation for LLM Inference Accelerator

Yue Jiet Chong; Yimin Wang; Zhen Wu; Xuanyao Fong

arXiv:2601.13628·cs.AR·January 21, 2026

PRIMAL: Processing-In-Memory Based Low-Rank Adaptation for LLM Inference Accelerator

Yue Jiet Chong, Yimin Wang, Zhen Wu, Xuanyao Fong

PDF

Open Access

TL;DR

PRIMAL is a novel processing-in-memory accelerator designed for efficient LLM inference with low-rank adaptation, achieving significant improvements in throughput and energy efficiency over traditional GPU-based systems.

Contribution

It introduces a PIM-based architecture with innovative SRAM reprogramming and dataflow optimization for low-rank LLM inference acceleration.

Findings

01

1.5x throughput improvement over NVIDIA H100

02

25x energy efficiency gain

03

Effective pipelined LoRA updates

Abstract

This paper presents PRIMAL, a processing-in-memory (PIM) based large language model (LLM) inference accelerator with low-rank adaptation (LoRA). PRIMAL integrates heterogeneous PIM processing elements (PEs), interconnected by 2D-mesh inter-PE computational network (IPCN). A novel SRAM reprogramming and power gating (SRPG) scheme enables pipelined LoRA updates and sub-linear power scaling by overlapping reconfiguration with computation and gating idle resources. PRIMAL employs optimized spatial mapping and dataflow orchestration to minimize communication overhead, and achieves $1.5 \times$ throughput and $25 \times$ energy efficiency over NVIDIA H100 with LoRA rank 8 (Q,V) on Llama-13B.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Embedded Systems Design Techniques · Parallel Computing and Optimization Techniques