FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

Zihao Xuan; Jia Chen; Yewen Li; Wei Xuan; Hegan Chen; Xiao Huo; Fengbin Tu

arXiv:2604.25317·cs.AR·April 29, 2026

FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

Zihao Xuan, Jia Chen, Yewen Li, Wei Xuan, Hegan Chen, Xiao Huo, Fengbin Tu

PDF

TL;DR

FusionCIM introduces a novel compute-in-memory architecture with operator fusion and dataflow optimizations, significantly improving energy efficiency and speed for large language model inference.

Contribution

It presents a hybrid CIM architecture with operator fusion, a dataflow that enhances data reuse, and an online-softmax mechanism, advancing LLM inference acceleration.

Findings

01

Achieves up to 3.86x energy savings over prior designs.

02

Realizes 1.98x speedup on LLaMA-3 model.

03

Attains 29.4 TOPS/W energy efficiency at system level.

Abstract

In this paper, we propose FusionCIM, an operator-fusion-driven compute-in-memory (CIM) accelerator architecture for efficient and scalable LLM inference, with three key innovations: (1) a hybrid CIM pipeline architecture that maps QKT computation on inner-product-based CIM (IP-CIM) and PV aggregation on outer-product-based CIM (OP-CIM) for efficient matrix multiplications fusion; (2) a QO-stationary dataflow that eliminates repeated KV loading in CIM and K-matrix access in buffer under transpose fusion, significantly improving data reuse on chip; and (3) a pattern-aware online-softmax mechanism that exploits distribution regularities of attention scores to reduce exponential rescaling overhead for non-linear fusion. Experimental results on LLaMA-3 model show that FusionCIM achieves up to 3.86x energy saving, and 1.98x speedup compared with prior SOTA CIM-based designs with 29.4 TOPS/W…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.