FlattenQuant: Breaking Through the Inference Compute-bound for Large   Language Models with Per-tensor Quantization

Yi Zhang; Fei Yang; Shuang Peng; Fangyu Wang; Aimin Pan

arXiv:2402.17985·cs.LG·February 29, 2024·1 cites

FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization

Yi Zhang, Fei Yang, Shuang Peng, Fangyu Wang, Aimin Pan

PDF

Open Access

TL;DR

FlattenQuant is a novel quantization method that reduces inference compute-bound issues in large language models by using per-tensor quantization with flattening, enabling faster inference and lower memory usage.

Contribution

The paper introduces FlattenQuant, a new per-tensor quantization technique that significantly improves inference speed and memory efficiency for LLMs with minimal accuracy loss.

Findings

01

Achieves 2× speedup in LLM inference

02

Reduces memory usage by 2.3×

03

Maintains negligible accuracy loss

Abstract

Large language models (LLMs) have demonstrated state-of-the-art performance across various tasks. However, the latency of inference and the large GPU memory consumption of LLMs restrict their deployment performance. Recently, there have been some efficient attempts to quantize LLMs, yet inference with large batch size or long sequence still has the issue of being compute-bound. Fine-grained quantization methods have showcased their proficiency in achieving low-bit quantization for LLMs, while requiring FP16 data type for linear layer computations, which is time-consuming when dealing with large batch size or long sequence. In this paper, we introduce a method called FlattenQuant, which significantly reduces the maximum value of the tensor by flattening the large channels in the tensor, to achieve low bit per-tensor quantization with minimal accuracy loss. Our experiments show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Natural Language Processing Techniques

MethodsLinear Layer