LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient   Inference in Large-Scale Generative Language Models

Gunho Park; Baeseong Park; Minsub Kim; Sungjae Lee; Jeonghoon Kim,; Beomseok Kwon; Se Jung Kwon; Byeongwook Kim; Youngjoo Lee; and Dongsoo Lee

arXiv:2206.09557·cs.DC·April 2, 2024·20 cites

LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models

Gunho Park, Baeseong Park, Minsub Kim, Sungjae Lee, Jeonghoon Kim,, Beomseok Kwon, Se Jung Kwon, Byeongwook Kim, Youngjoo Lee, and Dongsoo Lee

PDF

Open Access 2 Repos

TL;DR

LUT-GEMM introduces a novel LUT-based kernel for quantized matrix multiplication that accelerates large-scale language model inference by eliminating dequantization, achieving significant speed-ups on GPU.

Contribution

It presents LUT-GEMM, a new kernel for quantized matrix multiplication that reduces computational costs and improves inference speed in large language models.

Findings

01

Achieves 2.1× speed-up on OPT-175B with 3-bit quantization.

02

Eliminates resource-intensive dequantization process.

03

Demonstrates flexible trade-off between compression and accuracy.

Abstract

Recent advances in self-supervised learning and the Transformer architecture have significantly improved natural language processing (NLP), achieving remarkably low perplexity. However, the growing size of NLP models introduces a memory wall problem during the generation phase. To mitigate this issue, recent efforts have focused on quantizing model weights to sub-4-bit precision while preserving full precision for activations, resulting in practical speed-ups during inference on a single GPU. However, these improvements primarily stem from reduced memory movement, which necessitates a resource-intensive dequantization process rather than actual computational reduction. In this paper, we introduce LUT-GEMM, an efficient kernel for quantized matrix multiplication, which not only eliminates the resource-intensive dequantization process but also reduces computational costs compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Cosine Annealing · Softmax · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Warmup With Cosine Annealing · Attention Dropout · Dropout