Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal   Propagation Analysis for Large Language Models

Kartikeya Bhardwaj; Nilesh Prasad Pandey; Sweta Priyadarshi; Kyunggeun; Lee; Jun Ma; Harris Teague

arXiv:2403.18159·cs.LG·March 29, 2024·1 cites

Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal Propagation Analysis for Large Language Models

Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Kyunggeun, Lee, Jun Ma, Harris Teague

PDF

Open Access

TL;DR

This paper introduces a novel fine-tuning method called KD-QAT for 4-bit quantized large language models, enhancing their performance on edge devices by analyzing and stabilizing gradient propagation during training.

Contribution

It provides new insights into the stability of knowledge distillation-based quantization and proposes ov-freeze, a simple technique to improve low-bit quantized LLM performance.

Findings

01

Over 0.7% accuracy loss on Commonsense Reasoning benchmarks.

02

Stable training achieved with ov-freeze technique.

03

Near floating point performance at 4-bit quantization.

Abstract

Large generative models such as large language models (LLMs) and diffusion models have revolutionized the fields of NLP and computer vision respectively. However, their slow inference, high computation and memory requirement makes it challenging to deploy them on edge devices. In this study, we propose a light-weight quantization aware fine tuning technique using knowledge distillation (KD-QAT) to improve the performance of 4-bit weight quantized LLMs using commonly available datasets to realize a popular language use case, on device chat applications. To improve this paradigm of finetuning, as main contributions, we provide insights into stability of KD-QAT by empirically studying the gradient propagation during training to better understand the vulnerabilities of KD-QAT based approaches to low-bit quantization errors. Based on our insights, we propose ov-freeze, a simple technique to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsDiffusion · Knowledge Distillation · Attentive Walk-Aggregating Graph Neural Network