Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats
Anat Heilper, Doron Singer

TL;DR
This paper extends lossless compression techniques to low-precision neural network formats like FP8 and FP4, achieving significant size reductions and enabling efficient deployment of large models.
Contribution
It introduces a novel compression method for low-precision formats and demonstrates its effectiveness on model weights and K/V caches in LLMs.
Findings
Compression ratios up to 83% for FP8
Effective compression of K/V cache tensors
Extension of ZipNN to lower-precision formats
Abstract
As deep learning models grow and deployment becomes more widespread, reducing the storage and transmission costs of neural network weights has become increasingly important. While prior work such as ZipNN has shown that lossless compression methods - particularly those based on Huffman encoding floating-point exponents can significantly reduce model sizes, these techniques have primarily been applied to higher-precision formats such as FP32 and BF16. In this work, we extend the ZipNN approach to lower-precision floating-point formats, specifically FP8 and FP4, which are gaining popularity for efficient inference. We design a compression method that separates and compresses the exponent and mantissa components independently using entropy coding. Our evaluation shows compression ratios up to 62% for BF16 and 83% for FP8. We also investigate the compressibility of key-value (K/V) cache…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
