Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference

Patrick Yubeaton; Tareq Mahmoud; Shehab Naga; Pooria Taheri; Tianhua; Xia; Arun George; Yasmein Khalil; Sai Qian Zhang; Siddharth Joshi; Chinmay; Hegde; Siddharth Garg

arXiv:2502.00922·cs.LG·February 4, 2025

Huff-LLM: End-to-End Lossless Compression for Efficient LLM Inference

Patrick Yubeaton, Tareq Mahmoud, Shehab Naga, Pooria Taheri, Tianhua, Xia, Arun George, Yasmein Khalil, Sai Qian Zhang, Siddharth Joshi, Chinmay, Hegde, Siddharth Garg

PDF

Open Access

TL;DR

Huff-LLM introduces an end-to-end lossless compression method for large language models, enabling efficient storage, reduced bandwidth, and improved inference latency and energy efficiency on edge devices.

Contribution

It presents a novel lossless compression technique for LLMs that preserves model behavior and enhances deployment efficiency across various hardware platforms.

Findings

01

Enables storage of larger models in memory

02

Reduces bandwidth for weight loading

03

Improves inference latency and energy efficiency

Abstract

As they become more capable, large language models (LLMs) have continued to rapidly increase in size. This has exacerbated the difficulty in running state of the art LLMs on small, edge devices. Standard techniques advocate solving this problem through lossy compression techniques such as quantization or pruning. However, such compression techniques are lossy, and have been shown to change model behavior in unpredictable manners. We propose Huff-LLM, an \emph{end-to-end, lossless} model compression method that lets users store LLM weights in compressed format \emph{everywhere} -- cloud, disk, main memory, and even in on-chip memory/buffers. This allows us to not only load larger models in main memory, but also reduces bandwidth required to load weights on chip, and makes more efficient use of on-chip weight buffers. In addition to the memory savings achieved via compression, we also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Handwritten Text Recognition Techniques · Natural Language Processing Techniques