Getting Free Bits Back from Rotational Symmetries in LLMs

Jiajun He; Gergely Flamich; Jos\'e Miguel Hern\'andez-Lobato

arXiv:2410.01309·cs.IT·October 3, 2024

Getting Free Bits Back from Rotational Symmetries in LLMs

Jiajun He, Gergely Flamich, Jos\'e Miguel Hern\'andez-Lobato

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a bits-back coding format that exploits rotational symmetries in Transformer weights, enabling more efficient storage of LLMs by reducing bit usage without performance loss.

Contribution

It presents a novel encoding method leveraging rotational symmetries to compress Transformer weights more efficiently than traditional array layouts.

Findings

01

Achieved 3-5% reduction in total bit usage for LLMs

02

No impact on model performance within certain numerical precision

03

Applicable across different model sizes and architectures

Abstract

Current methods for compressing neural network weights, such as decomposition, pruning, quantization, and channel simulation, often overlook the inherent symmetries within these networks and thus waste bits on encoding redundant information. In this paper, we propose a format based on bits-back coding for storing rotationally symmetric Transformer weights more efficiently than the usual array layout at the same floating-point precision. We evaluate our method on Large Language Models (LLMs) pruned by SliceGPT (Ashkboos et al., 2024) and achieve a 3-5% reduction in total bit usage for free across different model sizes and architectures without impacting model performance within a certain numerical precision.

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 3

Strengths

1. The proposed method is novel and insightful 2. The proposed method is well-motivated and has the potential to make a broader impact. 3. The experiment results are promising.

Weaknesses

1. The writing is not self-contained. Specifically, the paper relies heavily on bits-back coding. However, they do not properly connect SliceGPT with the previously proposed bits-back coding. It is hard to understand the actual algorithm. 2. It is questionable if the method can be applied in the real world given the compression/decompression and matrix decomposition procedures involved. The run speed of this method could be slower than that of the vanilla Transformer model.

Reviewer 02Rating 5Confidence 3

Strengths

* The paper proposes a method to compress the rotation matrix Q introduced by SliceGPT using the bits-back algorithm, effectively reducing the parameter overhead. * It demonstrates that the rotation matrix Q can be encoded and decoded using the bits-back algorithm without requiring a calibration set, relying solely on the weight matrix. * The study shows that while the actual compression rate of SliceGPT with the rotation matrix Q is approximately 9%, the proposed method can achieve a closer-to-

Weaknesses

* The paper lacks sufficient analysis and experimentation regarding the practical impact on latency and throughput during inference when decoding the rotation matrix Q using the proposed method. * The proposed method is somewhat limited in scope, as it can only be applied after the implementation of SliceGPT, thereby restricting its applicability. * The actual benefits of encoding the rotation matrix Q in terms of inference latency and throughput might be minimal. It is likely that during the pr

Reviewer 03Rating 8Confidence 2

Strengths

- The paper presents bits-back coding used on neural network models, mainly focusing on enlarging language compression. - The proposed method is computationally feasible since it runs without retraining. - The paper's novel technique is evaluated on models, such as OPT and Llama-2, demonstrating performance metrics are not significantly affected in terms of perplexities drop.

Weaknesses

- This approach is inherently SliceGPT pruning and Transformer-specific architecture, which may also limit its use to other neural networks or pruning techniques. - The methodology relies only on Transformer architectures, so applicability to lighter models suited to edge devices could be considered.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParticle Accelerators and Free-Electron Lasers · Particle physics theoretical and experimental studies · Particle accelerators and beam dynamics

MethodsLinear Layer · Multi-Head Attention · Layer Normalization · Dense Connections · Attention Is All You Need · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding