Fast Matrix Multiplications for Lookup Table-Quantized LLMs
Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley,, Eric P. Xing, Yoon Kim

TL;DR
This paper introduces FLUTE, a novel lookup table engine that accelerates matrix multiplication in LUT-quantized LLMs, significantly improving inference speed and enabling efficient quantization of models like LLaMA3.
Contribution
The paper presents FLUTE, a flexible, optimized kernel for LUT-quantized LLMs, and demonstrates its effectiveness in speeding up inference and improving quantization methods.
Findings
FLUTE achieves 2-4x speedup over existing GEMM kernels.
Applying FLUTE to LLaMA3 yields 1.5-2x throughput increase.
The method maintains competitive quantization performance.
Abstract
The deployment of large language models (LLMs) is often constrained by memory bandwidth, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. When coupled with custom kernels that fuse the dequantization and matmul operations, weight-only quantization can thus enable faster inference by reducing the amount of memory movement. However, developing high-performance kernels for weight-quantized LLMs presents substantial challenges, especially when the weights are compressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform, lookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup table engine for LUT-quantized LLMs, which uses offline restructuring of the quantized weight matrix to minimize bit manipulations associated with unpacking, and vectorization and duplication of the lookup…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsScientific Computing and Data Management · Natural Language Processing Techniques · Distributed and Parallel Computing Systems
