COMET: Towards Partical W4A4KV4 LLMs Serving
Lian Liu, Haimeng Ren, Long Cheng, Zhaohui Xu, Yudong Pan, Mengdi, Wang, Xiaowei Li, Yinhe Han, Ying Wang

TL;DR
This paper introduces COMET, a novel framework enabling practical 4-bit weight-activation quantization for large language models, significantly improving inference speed and efficiency on modern GPUs.
Contribution
It presents a new mixed-precision quantization algorithm, optimized GPU kernels, and an inference framework supporting 4-bit activation and weight quantization for LLMs.
Findings
Achieves 2.88x kernel speedup over cuBLAS.
Realizes 2.02x throughput improvement over TensorRT-LLM.
Supports efficient LLM inference on single A100 GPU.
Abstract
Quantization is a widely-used compression technology to reduce the overhead of serving large language models (LLMs) on terminal devices and in cloud data centers. However, prevalent quantization methods, such as 8-bit weight-activation or 4-bit weight-only quantization, achieve limited performance improvements due to poor support for low-precision (e.g., 4-bit) activation. This work, for the first time, realizes practical W4A4KV4 serving for LLMs, fully utilizing the INT4 tensor cores on modern GPUs and reducing the memory bottleneck caused by the KV cache. Specifically, we propose a novel fine-grained mixed-precision quantization algorithm (FMPQ) that compresses most activations into 4-bit with negligible accuracy loss. To support mixed-precision matrix multiplication for W4A4 and W4A8, we develop a highly optimized W4Ax kernel. Our approach introduces a novel mixed-precision data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvancements in Battery Materials · MXene and MAX Phase Materials · Recycling and Waste Management Techniques
MethodsLLaMA
