RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write
Yan-Cheng Guo, Tian-Sheuan Chang, and Jian-Wei Su

TL;DR
This paper introduces RCW-CIM, a novel digital CIM-based LLM accelerator that minimizes weight update latency and improves overall performance through innovative architecture and dataflow optimizations.
Contribution
It proposes a read-compute/write architecture with nonlinear operator fusion and a new dataflow, significantly reducing latency and DRAM access for LLM acceleration.
Findings
Decoding latency reduced by 21.59% on Llama2-7B.
Latency reduced by 69.17% through nonlinear operator fusion.
DRAM access and CIM weight updates reduced by 51.6% and 87.6%.
Abstract
Digital computing-in-memory (DCIM) has emerged as a promising solution for large language model (LLM) acceleration by minimizing data transfers between external DRAM and on-chip accelerators while maintaining high precision for superior accuracy. However, existing CIM architectures often overlook weight update latency, which becomes critical as LLM weights are far larger than a single CIM macro capacity. To address this issue, this paper proposes a read-compute/write (RCW) architecture that effectively minimizes weight update latency, along with a nonlinear operator fusion that further mitigates dependencyinduced latency. The proposed RCW reduces decoding computing latency by 21.59% on the Llama2-7B model. In addition, the nonlinear operator fusion mechanism achieves a 69.17% latency reduction through efficient partial accumulation and group-based approximation. Furthermore, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
