The Uniqueness of LLaMA3-70B Series with Per-Channel Quantization
Minghai Qin

TL;DR
This paper investigates why LLaMA3-70B models are uniquely vulnerable to quantization and proposes two effective strategies to mitigate accuracy loss, enabling efficient deployment without sacrificing performance.
Contribution
The paper identifies weight distribution as the key factor behind LLaMA3-70B's quantization vulnerability and introduces two novel quantization strategies to address this issue.
Findings
LLaMA3-70B shows significant accuracy degradation with W8A8 quantization.
Other models like LLaMA2 and LLaMA3/3.1-8B are robust under W8A8.
Proposed strategies restore LLaMA3-70B accuracy to FP16 levels.
Abstract
We have observed a distinctive quantization-related behavior in the LLaMA3/3.1-70B models that is absent in both the LLaMA2-70B and LLaMA3/3.1/3.2-1B/3B/8B/405B models. Quantization is a crucial technique for deploying large language models (LLMs) efficiently. The impact of W8A8 post-training quantization on model accuracy, especially on the recently released LLaMA3/3.1 model series, remains contentious. In this paper, we explore three key questions: What makes the LLaMA3-70B model series uniquely vulnerable to quantization? Why is this the case? And how can the issue be addressed? We empirically investigate multiple LLMs featured on an open LLM leaderboard, discovering that the LLaMA3-70B model series have a unique accuracy degradation behavior with W8A8 per-channel post-training quantization. In contrast, other model series such as LLaMA2, LLaMA3/3.1-8B, LLaMA3.2, Qwen, Mixtral,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging Techniques and Applications
MethodsLinear Layer · Adam · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention · Byte Pair Encoding · Absolute Position Encodings
