QUICK: Quantization-aware Interleaving and Conflict-free Kernel for   efficient LLM inference

Taesu Kim; Jongho Lee; Daehyun Ahn; Sarang Kim; Jiwoong Choi; Minkyu; Kim; Hyungjun Kim

arXiv:2402.10076·cs.LG·February 16, 2024·2 cites

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu, Kim, Hyungjun Kim

PDF

Open Access 1 Repo

TL;DR

QUICK introduces optimized CUDA kernels for quantized LLM inference, solving memory conflict issues and achieving significant speedups over existing methods on various GPUs.

Contribution

The paper presents a novel interleaving technique for quantized weights and conflict-free kernels that improve inference efficiency of large language models.

Findings

01

Up to 1.91x speedup over AutoAWQ kernels on large batches

02

Up to 1.94x throughput gain on LLMs across NVIDIA GPUs

03

Effective mitigation of shared memory bank conflicts in quantized matrix multiplication

Abstract

We introduce QUICK, a group of novel optimized CUDA kernels for the efficient inference of quantized Large Language Models (LLMs). QUICK addresses the shared memory bank-conflict problem of state-of-the-art mixed precision matrix multiplication kernels. Our method interleaves the quantized weight matrices of LLMs offline to skip the shared memory write-back after the dequantization. We demonstrate up to 1.91x speedup over existing kernels of AutoAWQ on larger batches and up to 1.94x throughput gain on representative LLM models on various NVIDIA GPU devices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

squeezebits/quick
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Natural Language Processing Techniques