LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs
Zifan He, Shengyu Ye, Rui Ma, Yang Wang, Jason Cong

TL;DR
LUT-LLM introduces a memory-based FPGA accelerator for large language models, leveraging table lookups and vector quantization to significantly improve inference speed and energy efficiency compared to GPUs.
Contribution
This work presents the first FPGA-based LLM inference method using memory-based computations with vector quantization, enabling scalable deployment of large models.
Findings
Achieves 1.10 to 3.29 times faster generation speed than GPUs.
Provides 3.05 to 6.60 times higher energy efficiency.
Reduces arithmetic operations by 4 times.
Abstract
The rapid development of large language models (LLM) has greatly enhanced everyday applications. While many FPGA-based accelerators, with flexibility for fine-grained data control, exhibit superior speed and energy efficiency compared to GPUs, recent GPU-specific optimizations have diminished this advantage. When limited to arithmetic-based computation, FPGAs often underperform GPUs due to their comparatively fewer computational resources. To address this challenge, we exploit a key advantage of FPGAs over GPUs: abundant distributed on-chip memory embedded among computational units. We believe that shifting LLM inference from arithmetic-based to memory-based computations through table lookups can improve the efficiency on FPGAs to compete with GPUs. However, existing methods are inefficient or unable to scale and deploy language models due to algorithm and architecture design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Big Data and Digital Economy · Advanced Neural Network Applications
