Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration
Md Zesun Ahmed Mia, Jiahui Duan, Kai Ni, Abhronil Sengupta

TL;DR
TrilinearCIM is a novel FeFET-based architecture enabling energy-efficient, reprogramming-free in-memory Transformer attention computation, significantly improving performance and reducing energy consumption.
Contribution
It introduces a three-operand multiply-accumulate primitive for in-memory attention, eliminating dynamic reprogramming in CIM architectures.
Findings
Outperforms conventional CIM on most GLUE tasks.
Achieves up to 46.6% energy reduction and 20.4% latency improvement.
Performs complete Transformer attention in NVM cores without reprogramming.
Abstract
Self-attention in Transformers generates dynamic operands that force conventional Compute-in-Memory (CIM) accelerators into costly non-volatile memory (NVM) reprogramming cycles, degrading throughput and stressing device endurance. Existing solutions either reduce but retain NVM writes through matrix decomposition or sparsity, or move attention computation to digital CMOS at the expense of NVM density. We present TrilinearCIM, a Double-Gate FeFET (DG-FeFET)-based architecture that uses back-gate modulation to realize a three-operand multiply-accumulate primitive for in-memory attention computation without dynamic ferroelectric reprogramming. Evaluated on BERT-base (GLUE) and ViT-base (ImageNet and CIFAR), TrilinearCIM outperforms conventional CIM on seven of nine GLUE tasks while achieving up to 46.6\% energy reduction and 20.4\% latency improvement over conventional FeFET CIM at 37.3\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
