Fast Matrix Multiplications for Lookup Table-Quantized LLMs

Han Guo; William Brandon; Radostin Cholakov; Jonathan Ragan-Kelley,; Eric P. Xing; Yoon Kim

arXiv:2407.10960·cs.LG·January 20, 2025

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley,, Eric P. Xing, Yoon Kim

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces FLUTE, a novel lookup table engine that accelerates matrix multiplication in LUT-quantized LLMs, significantly improving inference speed and enabling efficient quantization of models like LLaMA3.

Contribution

The paper presents FLUTE, a flexible, optimized kernel for LUT-quantized LLMs, and demonstrates its effectiveness in speeding up inference and improving quantization methods.

Findings

01

FLUTE achieves 2-4x speedup over existing GEMM kernels.

02

Applying FLUTE to LLaMA3 yields 1.5-2x throughput increase.

03

The method maintains competitive quantization performance.

Abstract

The deployment of large language models (LLMs) is often constrained by memory bandwidth, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. When coupled with custom kernels that fuse the dequantization and matmul operations, weight-only quantization can thus enable faster inference by reducing the amount of memory movement. However, developing high-performance kernels for weight-quantized LLMs presents substantial challenges, especially when the weights are compressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform, lookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup table engine for LUT-quantized LLMs, which uses offline restructuring of the quantized weight matrix to minimize bit manipulations associated with unpacking, and vectorization and duplication of the lookup…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hanguo97/flute
jaxOfficial

Videos

Fast Matrix Multiplications for Lookup Table-Quantized LLMs· underline

Taxonomy

TopicsScientific Computing and Data Management · Natural Language Processing Techniques · Distributed and Parallel Computing Systems