Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms
Yaozheng Zhang, Wei Wang, Jie Kong, Jiehan Zhou, Xianwei Zhang, Huanqing Cui, Han Bao, Yuhai Liu

TL;DR
Opt4GPTQ is a comprehensive optimization framework that enhances the inference efficiency of 4-bit GPTQ quantized large language models on heterogeneous hardware platforms, achieving significant throughput improvements.
Contribution
This paper introduces Opt4GPTQ, a novel platform-aware optimization method combining shared memory, vectorized loading, and inline assembly techniques for efficient LLM inference.
Findings
Achieves up to 84.42% throughput improvement.
Maintains original model accuracy after optimization.
Demonstrates effectiveness across various models and hardware platforms.
Abstract
The increasing adoption of large language models (LLMs) on heterogeneous computing platforms poses significant challenges to achieving high inference efficiency. To address these efficiency bottlenecks across diverse platforms, this paper proposes Opt4GPTQ, a practical optimization method designed for 4-bit GPTQ quantized LLMs inference on heterogeneous AI accelerators. Built upon the vLLM serving system, Opt4GPTQ integrates three platform-level optimization strategies: Shared Memory Buffering Optimization (SMB-Opt), which caches frequently accessed data in shared memory and employs single-threaded writes; Vectorized Memory Loading Optimization (VML-Opt), which utilizes vectorized memory operations for efficient data loading; and Inline Assembly Optimization (ILA-Opt), which directly leverages hardwarenative vector half-precision addition and fused multiply-accumulate instructions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Machine Learning in Materials Science
