QET: Enhancing Quantized LLM Parameters and KV cache Compression through   Element Substitution and Residual Clustering

Yanshu Wang; Wang Li; Zhaoqian Yao; Tong Yang

arXiv:2407.03637·cs.LG·September 9, 2024

QET: Enhancing Quantized LLM Parameters and KV cache Compression through Element Substitution and Residual Clustering

Yanshu Wang, Wang Li, Zhaoqian Yao, Tong Yang

PDF

Open Access

TL;DR

This paper introduces QET, a novel method for matrix quantization that minimizes error and enhances compression efficiency in large language models and cache systems through element substitution and residual clustering.

Contribution

We formulate the QEM problem for matrix quantization, design the QET algorithm leveraging local element orderliness, and propose optimizations for improved accuracy and speed.

Findings

01

QET reduces MSE to 5.05% of the best existing method on LLM datasets.

02

QET achieves significant error reduction on K cache and V cache.

03

Optimizations improve both quantization accuracy and computational efficiency.

Abstract

The matrix quantization entails representing matrix elements in a more space-efficient form to reduce storage usage, with dequantization restoring the original matrix for use. We formulate the Quantization Error Minimization (QEM) problem as minimizing the distance between a matrix before and after quantization, under the condition that the quantized matrix occupies the same memory space. Matrix quantization is crucial in various applications, including Large Language Models (LLMs) weight quantization, vector databases, KV cache quantization, graph compression, and image compression. Recent advancements in LLMs, such as GPT-4 and BERT, have highlighted the importance of matrix compression due to the large size of parameters and KV cache, which are stored as matrices. We propose Quantum Entanglement Trees (QET) to address the QEM problem by leveraging the local orderliness of matrix…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsElectromagnetic Simulation and Numerical Methods · Algorithms and Data Compression · Parallel Computing and Optimization Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · WordPiece · Residual Connection · Layer Normalization · Multi-Head Attention · Linear Warmup With Linear Decay · Weight Decay