Power-of-Two Quantization-Aware-Training (PoT-QAT) in Large Language Models (LLMs)
Mahmoud Elgenedy

TL;DR
This paper introduces Power-of-Two Quantization-Aware-Training (PoT-QAT) for large language models, significantly reducing memory and computation requirements while maintaining performance, enabling efficient deployment on edge devices.
Contribution
The paper proposes a novel PoT quantization method combined with QAT to improve LLM efficiency, achieving substantial memory savings and faster inference with minimal performance loss.
Findings
Memory saving of approximately 87.5%
Inference speed increased by 3-10x
Perplexity improved by 66% after quantization
Abstract
In Large Language Models (LLMs), the number of parameters has grown exponentially in the past few years, e.g., from 1.5 billion parameters in GPT-2 to 175 billion in GPT-3 to possibly more than trillion in higher versions. This raises a significant challenge for implementation, especially for Edge devices. Unlike cloud computing, memory and processing power for Edge devices are very limited, which necessitates developing novel ideas to make such applications feasible. In this work, we investigate compressing weights with a special quantization that limits numbers to only power-of-two (PoT). This helps save a huge amount of memory as only exponents need to be stored, more importantly, it significantly reduces processing power by replacing costly multiplication with low cost bit shifting. To overcome performance loss due to this strict quantization, we investigate Quantization Aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Machine Learning and Data Classification · Natural Language Processing Techniques
