ELUTQ: Optimizing Quantization Accuracy under LUT-Based Computation for Edge LLMs
Xin Nie, Liang Dong, Haicheng Zhang, Jiawang Xiao, G. Sun

TL;DR
ELUTQ introduces a novel hierarchical linear quantization method that improves low-bit weight quantization accuracy and efficiency for edge deployment of large language models, reducing hardware requirements and enabling fast inference.
Contribution
The paper proposes HLQ, a new quantization format that captures weight statistics better and eliminates dequantization overhead, with an optimized pipeline for large-scale model quantization and deployment.
Findings
HLQ significantly improves low-bit quantization accuracy.
ELUTQ enables quantization of LLaMA 3.1-70B with limited hardware.
2-bit LLaMA 3.1-8B achieves 1.5x speedup over AWQ.
Abstract
Weight quantization effectively reduces memory consumption and enable the deployment of Large Language Models on edge devices, yet existing hardware-friendly methods often rely on uniform quantization, which suffers from poor weight-distribution fitting and high dequantization overhead under low-bit settings. In this paper, we propose ELUTQ, an efficient quantization framework featuring a novel quantization format termed Hierarchical Linear Quantization (HLQ). HLQ is designed to better capture the statistical characteristics of weights and eliminate dequantization overhead using Bit-serial LUT-based GEMM operations. HLQ significantly improves model accuracy under low-bit settings and achieves performance comparable to QAT methods without any retraining of the weights. Moreover, an optimized quantization pipeline is integrated into ELUTQ, enabling it to complete the quantization of LLaMA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Natural Language Processing Techniques · Advanced Neural Network Applications
