Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models
Nathan Leroux, Paul-Philipp Manea, Chirag Sudarshan, Jan Finkbeiner,, Sebastian Siegel, John Paul Strachan, Emre Neftci

TL;DR
This paper introduces an analog in-memory computing architecture using gain cells for self-attention in Transformers, significantly reducing latency and energy consumption for large language models.
Contribution
It proposes a novel in-memory computing architecture with gain cells and an initialization algorithm that enables efficient, low-power Transformer inference without retraining.
Findings
Reduces attention latency by up to two orders of magnitude.
Cuts energy consumption by up to five orders of magnitude.
Achieves GPT-2 level performance without retraining.
Abstract
Transformer networks, driven by self-attention, are central to Large Language Models. In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks. We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention. However, the analog gain cell circuits introduce non-idealities and constraints preventing the direct mapping of pre-trained models. To circumvent this problem, we design an initialization algorithm achieving text processing performance comparable to GPT-2 without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Dropout · Discriminative Fine-Tuning · Cosine Annealing · Linear Layer · Attention Dropout · Layer Normalization · Byte Pair Encoding
