Mitigating Quantization Errors Due to Activation Spikes in GLU-Based LLMs
Jaewoo Yang, Hayun Kim, Younghoon Kim

TL;DR
This paper identifies activation spikes in GLU-based LLMs during quantization, which cause performance degradation, and proposes empirical methods to mitigate these spikes, improving quantization accuracy for various modern LLMs.
Contribution
The paper reveals the pattern of activation spikes in GLU variants and introduces two empirical methods, QFeM and QFeP, to effectively mitigate these spikes during quantization.
Findings
Activation spikes occur mainly in early and late layers.
Proposed methods significantly improve quantization performance.
Methods outperform existing techniques like SmoothQuant.
Abstract
Modern large language models (LLMs) have established state-of-the-art performance through architectural improvements, but still require significant computational cost for inference. In an effort to reduce the inference cost, post-training quantization (PTQ) has become a popular approach, quantizing weights and activations to lower precision, such as INT8. In this paper, we reveal the challenges of activation quantization in GLU variants, which are widely used in feed-forward network (FFN) of modern LLMs, such as LLaMA family. The problem is that severe local quantization errors, caused by excessive magnitudes of activation in GLU variants, significantly degrade the performance of the quantized LLM. We denote these activations as activation spikes. Our further observations provide a systematic pattern of activation spikes: 1) The activation spikes occur in the FFN of specific layers,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · CCD and CMOS Imaging Sensors · Advancements in Photolithography Techniques
MethodsLLaMA
