Interactions Across Blocks in Post-Training Quantization of Large   Language Models

Khasmamad Shabanovi; Lukas Wiest; Vladimir Golkov; Daniel Cremers,; Thomas Pfeil

arXiv:2411.03934·cs.LG·November 7, 2024

Interactions Across Blocks in Post-Training Quantization of Large Language Models

Khasmamad Shabanovi, Lukas Wiest, Vladimir Golkov, Daniel Cremers,, Thomas Pfeil

PDF

Open Access

TL;DR

This paper investigates how interactions across blocks affect post-training quantization of large language models, proposing multi-block fine-tuning strategies to improve quantization quality based on model-specific effects.

Contribution

It introduces two novel multi-block fine-tuning methods that consider inter-block interactions, moving beyond traditional single-block quantization approaches.

Findings

01

Methods show significant benefits for some models.

02

No impact observed on certain models.

03

Highlights importance of model-specific strategies.

Abstract

Post-training quantization is widely employed to reduce the computational demands of neural networks. Typically, individual substructures, such as layers or blocks of layers, are quantized with the objective of minimizing quantization errors in their pre-activations by fine-tuning the corresponding weights. Deriving this local objective from the global objective of minimizing task loss involves two key simplifications: assuming substructures are mutually independent and ignoring the knowledge of subsequent substructures as well as the task loss. In this work, we assess the effects of these simplifications on weight-only quantization of large language models. We introduce two multi-block fine-tuning strategies and compare them against the baseline of fine-tuning single transformer blocks. The first captures correlations of weights across blocks by jointly optimizing multiple quantized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques