Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal Propagation Analysis for Large Language Models
Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Kyunggeun, Lee, Jun Ma, Harris Teague

TL;DR
This paper introduces a novel fine-tuning method called KD-QAT for 4-bit quantized large language models, enhancing their performance on edge devices by analyzing and stabilizing gradient propagation during training.
Contribution
It provides new insights into the stability of knowledge distillation-based quantization and proposes ov-freeze, a simple technique to improve low-bit quantized LLM performance.
Findings
Over 0.7% accuracy loss on Commonsense Reasoning benchmarks.
Stable training achieved with ov-freeze technique.
Near floating point performance at 4-bit quantization.
Abstract
Large generative models such as large language models (LLMs) and diffusion models have revolutionized the fields of NLP and computer vision respectively. However, their slow inference, high computation and memory requirement makes it challenging to deploy them on edge devices. In this study, we propose a light-weight quantization aware fine tuning technique using knowledge distillation (KD-QAT) to improve the performance of 4-bit weight quantized LLMs using commonly available datasets to realize a popular language use case, on device chat applications. To improve this paradigm of finetuning, as main contributions, we provide insights into stability of KD-QAT by empirically studying the gradient propagation during training to better understand the vulnerabilities of KD-QAT based approaches to low-bit quantization errors. Based on our insights, we propose ov-freeze, a simple technique to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsDiffusion · Knowledge Distillation · Attentive Walk-Aggregating Graph Neural Network
