Enabling Energy-Efficient Deployment of Large Language Models on   Memristor Crossbar: A Synergy of Large and Small

Zhehui Wang; Tao Luo; Cheng Liu; Weichen Liu; Rick Siow Mong Goh,; Weng-Fai Wong

arXiv:2410.15977·cs.AI·October 22, 2024

Enabling Energy-Efficient Deployment of Large Language Models on Memristor Crossbar: A Synergy of Large and Small

Zhehui Wang, Tao Luo, Cheng Liu, Weichen Liu, Rick Siow Mong Goh,, Weng-Fai Wong

PDF

TL;DR

This paper introduces a novel memristor crossbar architecture that enables energy-efficient deployment of large language models like BERT_Large on a single chip, overcoming size and operation limitations with significant improvements in area and energy efficiency.

Contribution

The paper presents a new architecture that allows large language models to be efficiently deployed on memristor crossbars, addressing size, multi-head attention, and nonlinear operation challenges.

Findings

01

Achieves up to 39X reduction in area overhead compared to traditional memristor crossbars.

02

Realizes up to 18X energy savings over traditional memristor crossbars.

03

Demonstrates at least 68X reduction in area-delay product compared to TPU/GPU systems.

Abstract

Large language models (LLMs) have garnered substantial attention due to their promising applications in diverse domains. Nevertheless, the increasing size of LLMs comes with a significant surge in the computational requirements for training and deployment. Memristor crossbars have emerged as a promising solution, which demonstrated a small footprint and remarkably high energy efficiency in computer vision (CV) models. Memristors possess higher density compared to conventional memory technologies, making them highly suitable for effectively managing the extreme model size associated with LLMs. However, deploying LLMs on memristor crossbars faces three major challenges. Firstly, the size of LLMs increases rapidly, already surpassing the capabilities of state-of-the-art memristor chips. Secondly, LLMs often incorporate multi-head attention blocks, which involve non-weight stationary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Attention Is All You Need · Softmax · Multi-Head Attention