Pyramid Vector Quantization for LLMs

Tycho F. A. van der Ouderaa; Maximilian L. Croci; Agrin Hilmkil; James; Hensman

arXiv:2410.16926·cs.LG·December 5, 2024

Pyramid Vector Quantization for LLMs

Tycho F. A. van der Ouderaa, Maximilian L. Croci, Agrin Hilmkil, James, Hensman

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Pyramid Vector Quantization (PVQ) for large language models, leveraging spherical geometry and Hessian information to achieve state-of-the-art compression with minimal accuracy loss.

Contribution

It develops a practical PVQ algorithm combined with scale and Hessian-based quantization, enabling efficient, high-quality model compression without explicit codebooks.

Findings

01

Quantized Llama-3 70B to 3.25 bits per weight with 98% accuracy

02

Achieved state-of-the-art trade-off between compression and performance

03

Extended PVQ to incorporate Hessian information for error minimization

Abstract

Recent works on compression of large language models (LLM) using quantization considered reparameterizing the architecture such that weights are distributed on the sphere. This demonstratively improves the ability to quantize by increasing the mathematical notion of coherence, resulting in fewer weight outliers without affecting the network output. In this work, we aim to further exploit this spherical geometry of the weights when performing quantization by considering Pyramid Vector Quantization (PVQ) for large language models. Arranging points evenly on the sphere is notoriously difficult, especially in high dimensions, and in case approximate solutions exists, representing points explicitly in a codebook is typically not feasible due to its additional memory cost. Instead, PVQ uses a fixed integer lattice on the sphere by projecting points onto the 1-sphere, which allows for…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 3

Strengths

-

Weaknesses

-please provide proper referencing for “coherence processing” on page 2. - on page 3, the \hat{W} is not well introduced. how is it differen from W ? also \hat{W} should be bold since it is a matrix. - move the legend in fig 2 - introduce all sets and params before their first appearance: S_D,k , P_D,K not introduced in section 2.4. sets are better be distinguishable from matrices , you could use caligraphic S and P for sets. define “G” in eq 7. - scalars better be represented b

Reviewer 02Rating 6Confidence 4

Strengths

The topic of quantizing LLM's is timely and important. Quite a few commercial tools offer bit-level compression of LLMs and it appears that theoretical treatment of the topic is lagging. Analyzing and discussing PVQ in the context of LLMs are solid contributions because the method is used in some applications. The PVQ description and illustrations are clear and engaging. Performance comparison to other methods is generally relevant although I have comments below.

Weaknesses

The discussion concerning the advantages of PVQ over other quantization techniques is misleading because the advantage of an implicit codebook generally exists in all practical VQ techniques. See standard references in the topic like: [1] Gray, Robert M., and David L. Neuhoff. "Quantization." IEEE Transactions on Information Theory 44, no. 6 (1998): 2325-2383. Additionally, there is no discussion on the motivation for using VQ compared to scalar quantization in LLMs. Here I'd expect the autho

Reviewer 03Rating 1Confidence 5

Strengths

The idea of using spherical code to compress weight and activation may be worth studying. If it can be shown convincingly that this approach indeed outperforms others, then it can make a worthwhile contribution to the area of LLM quantization/compression.

Weaknesses

I found the paper poorly written, lacking clarity, and missing some important baseline comparisons to be convincing. 1. Quantization can be used for different purposes in LLMs. It is done for compressing LLMs (either for storage and transportation across platforms, or for reducing memory usage by reusing weights), but it can also be done for approximate computation. Depending on the targetted usage scenario, the evaluation needs to be designed more carefully. It was never made clear what scenar

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Filter Design and Implementation · Advanced Data Compression Techniques · Coding theory and cryptography