Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms

Yaozheng Zhang; Wei Wang; Jie Kong; Jiehan Zhou; Xianwei Zhang; Huanqing Cui; Han Bao; Yuhai Liu

arXiv:2511.19438·cs.DC·February 6, 2026

Opt4GPTQ: Co-Optimizing Memory and Computation for 4-bit GPTQ Quantized LLM Inference on Heterogeneous Platforms

Yaozheng Zhang, Wei Wang, Jie Kong, Jiehan Zhou, Xianwei Zhang, Huanqing Cui, Han Bao, Yuhai Liu

PDF

Open Access

TL;DR

Opt4GPTQ is a comprehensive optimization framework that enhances the inference efficiency of 4-bit GPTQ quantized large language models on heterogeneous hardware platforms, achieving significant throughput improvements.

Contribution

This paper introduces Opt4GPTQ, a novel platform-aware optimization method combining shared memory, vectorized loading, and inline assembly techniques for efficient LLM inference.

Findings

01

Achieves up to 84.42% throughput improvement.

02

Maintains original model accuracy after optimization.

03

Demonstrates effectiveness across various models and hardware platforms.

Abstract

The increasing adoption of large language models (LLMs) on heterogeneous computing platforms poses significant challenges to achieving high inference efficiency. To address these efficiency bottlenecks across diverse platforms, this paper proposes Opt4GPTQ, a practical optimization method designed for 4-bit GPTQ quantized LLMs inference on heterogeneous AI accelerators. Built upon the vLLM serving system, Opt4GPTQ integrates three platform-level optimization strategies: Shared Memory Buffering Optimization (SMB-Opt), which caches frequently accessed data in shared memory and employs single-threaded writes; Vectorized Memory Loading Optimization (VML-Opt), which utilizes vectorized memory operations for efficient data loading; and Inline Assembly Optimization (ILA-Opt), which directly leverages hardwarenative vector half-precision addition and fused multiply-accumulate instructions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Big Data and Digital Economy · Machine Learning in Materials Science