Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques

Jie Kong; Junxiang Zhang; Jiheng Xu; Yalong Li; Shouhua Zhang; Jiehan Zhou; Yuhai Liu; Peng Liang; Quan Zhang; Luohan Jiang

arXiv:2505.02351·cs.DC·July 11, 2025

Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques

Jie Kong, Junxiang Zhang, Jiheng Xu, Yalong Li, Shouhua Zhang, Jiehan Zhou, Yuhai Liu, Peng Liang, Quan Zhang, Luohan Jiang

PDF

Open Access

TL;DR

Opt-GPTQ introduces an optimized attention mechanism combining grouping, sharing, and quantization techniques to significantly improve efficiency and scalability of large-scale models in deep learning.

Contribution

It proposes a novel combination of grouped query attention and quantization, optimizing attention mechanisms for better performance and resource utilization in large models.

Findings

01

Reduces computation time and memory usage

02

Enhances long-sequence processing capabilities

03

Improves model performance with optimized attention

Abstract

In the field of deep learning, traditional attention mechanisms face significant challenges related to high computational complexity and large memory consumption when processing long sequence data. To address these limitations, we propose Opt-GPTQ, an optimized Gradient-based Post Training Quantization (GPTQ) combining the Grouped Query Attention (GQA) mechanism with paging memory management, optimizing the traditional Multi-Head Attention (MHA) mechanism by grouping query heads and sharing key-value vectors. Optimized GQA (Opt-GQA) effectively reduces computational complexity, minimizes memory fragmentation, and enhances memory utilization for large-scale models. Opt-GPTQ is optimized for Data Center Units (DCUs) and integrated into the vLLM model to maximize hardware efficiency. It customizes GPU kernels to further enhance attention computation by reducing memory access latency and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlind Source Separation Techniques · Advanced Image Processing Techniques