MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

Yu Zhang; Mingzi Wang; Lancheng Zou; Wulong Liu; Hui-Ling Zhen,; Mingxuan Yuan; Bei Yu

arXiv:2411.16158·cs.LG·November 26, 2024

MixPE: Quantization and Hardware Co-design for Efficient LLM Inference

Yu Zhang, Mingzi Wang, Lancheng Zou, Wulong Liu, Hui-Ling Zhen,, Mingxuan Yuan, Bei Yu

PDF

Open Access

TL;DR

MixPE introduces a specialized hardware design that enhances low-bit quantization efficiency for large language model inference by reducing dequantization overhead and utilizing shift extbackslash&add operations, achieving significant speed and energy improvements.

Contribution

The paper presents MixPE, a novel mixed-precision processing element that optimizes LLM inference by reducing dequantization overhead and replacing multipliers with shift extbackslash&add operations.

Findings

01

2.6x speedup over state-of-the-art accelerators

02

1.4x energy reduction

03

Effective low-bit quantization for LLM inference

Abstract

Transformer-based large language models (LLMs) have achieved remarkable success as model sizes continue to grow, yet their deployment remains challenging due to significant computational and memory demands. Quantization has emerged as a promising solution, and state-of-the-art quantization algorithms for LLMs introduce the need for mixed-precision matrix multiplication (mpGEMM), where lower-precision weights are multiplied with higher-precision activations. Despite its benefits, current hardware accelerators such as GPUs and TPUs lack native support for efficient mpGEMM, leading to inefficient dequantization operations in the main sequential loop. To address this limitation, we introduce MixPE, a specialized mixed-precision processing element designed for efficient low-bit quantization in LLM inference. MixPE leverages two key innovations to minimize dequantization overhead and unlock…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques