QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu, Kim, Hyungjun Kim

TL;DR
QUICK introduces optimized CUDA kernels for quantized LLM inference, solving memory conflict issues and achieving significant speedups over existing methods on various GPUs.
Contribution
The paper presents a novel interleaving technique for quantized weights and conflict-free kernels that improve inference efficiency of large language models.
Findings
Up to 1.91x speedup over AutoAWQ kernels on large batches
Up to 1.94x throughput gain on LLMs across NVIDIA GPUs
Effective mitigation of shared memory bank conflicts in quantized matrix multiplication
Abstract
We introduce QUICK, a group of novel optimized CUDA kernels for the efficient inference of quantized Large Language Models (LLMs). QUICK addresses the shared memory bank-conflict problem of state-of-the-art mixed precision matrix multiplication kernels. Our method interleaves the quantized weight matrices of LLMs offline to skip the shared memory write-back after the dequantization. We demonstrate up to 1.91x speedup over existing kernels of AutoAWQ on larger batches and up to 1.94x throughput gain on representative LLM models on various NVIDIA GPU devices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Natural Language Processing Techniques
