Analog In-Memory Computing Attention Mechanism for Fast and   Energy-Efficient Large Language Models

Nathan Leroux; Paul-Philipp Manea; Chirag Sudarshan; Jan Finkbeiner,; Sebastian Siegel; John Paul Strachan; Emre Neftci

arXiv:2409.19315·cs.NE·November 26, 2024

Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models

Nathan Leroux, Paul-Philipp Manea, Chirag Sudarshan, Jan Finkbeiner,, Sebastian Siegel, John Paul Strachan, Emre Neftci

PDF

Open Access 1 Repo

TL;DR

This paper introduces an analog in-memory computing architecture using gain cells for self-attention in Transformers, significantly reducing latency and energy consumption for large language models.

Contribution

It proposes a novel in-memory computing architecture with gain cells and an initialization algorithm that enables efficient, low-power Transformer inference without retraining.

Findings

01

Reduces attention latency by up to two orders of magnitude.

02

Cuts energy consumption by up to five orders of magnitude.

03

Achieves GPT-2 level performance without retraining.

Abstract

Transformer networks, driven by self-attention, are central to Large Language Models. In generative Transformers, self-attention uses cache memory to store token projections, avoiding recomputation at each time step. However, GPU-stored projections must be loaded into SRAM for each new generation step, causing latency and energy bottlenecks. We present a custom self-attention in-memory computing architecture based on emerging charge-based memories called gain cells, which can be efficiently written to store new tokens during sequence generation and enable parallel analog dot-product computation required for self-attention. However, the analog gain cell circuits introduce non-idealities and constraints preventing the direct mapping of pre-trained models. To circumvent this problem, we design an initialization algorithm achieving text processing performance comparable to GPT-2 without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NathanLeroux-git/GainCellAttention
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Neural Network Applications · Ferroelectric and Negative Capacitance Devices

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Dense Connections · Dropout · Discriminative Fine-Tuning · Cosine Annealing · Linear Layer · Attention Dropout · Layer Normalization · Byte Pair Encoding