Prefixing Attention Sinks can Mitigate Activation Outliers for Large   Language Model Quantization

Seungwoo Son; Wonpyo Park; Woohyun Han; Kyuyeun Kim; Jaeho Lee

arXiv:2406.12016·cs.LG·October 7, 2024

Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization

Seungwoo Son, Wonpyo Park, Woohyun Han, Kyuyeun Kim, Jaeho Lee

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces CushionCache, a method that mitigates activation outliers in large language models by finding and tuning prompt token sequences, enabling more effective per-tensor activation quantization without extra overhead.

Contribution

The paper proposes CushionCache, a novel prefixing technique that reduces activation outliers, improving quantization performance for large language models.

Findings

01

Significantly improves W8A8 quantization accuracy.

02

Effectively reduces activation outliers across various models.

03

Seamlessly integrates with existing quantization methods.

Abstract

Despite recent advances in LLM quantization, activation quantization remains to be challenging due to the activation outliers. Conventional remedies, e.g., mixing precisions for different channels, introduce extra overhead and reduce the speedup. In this work, we develop a simple yet effective strategy to facilitate per-tensor activation quantization by preventing the generation of problematic tokens. Precisely, we propose a method to find a set of key-value cache, coined CushionCache, which mitigates outliers in subsequent tokens when inserted as a prefix. CushionCache works in two steps: First, we greedily search for a prompt token sequence that minimizes the maximum activation values in subsequent tokens. Then, we further tune the token cache to regularize the activations of subsequent tokens to be more quantization-friendly. The proposed method successfully addresses activation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ruikangliu/IntactKV
pytorch

Videos

Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training