EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

Arnab Sanyal; Gourav Datta; Prithwish Mukherjee; Sandeep P. Chinchali; Michael Orshansky

arXiv:2505.02380·cs.LG·May 5, 2026

EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices

Arnab Sanyal, Gourav Datta, Prithwish Mukherjee, Sandeep P. Chinchali, Michael Orshansky

PDF

TL;DR

EntroLLM introduces a novel entropy-based weight compression method combining mixed quantization and Huffman coding, enabling efficient large language model inference on edge devices with significant storage and speed improvements.

Contribution

The paper presents EntroLLM, a compression framework that enhances weight compressibility and inference efficiency without retraining, suitable for deployment on resource-constrained edge hardware.

Findings

01

Tensor-level quantization increases weight compressibility and improves Huffman encoding by up to 11.3x.

02

Up to 30% storage savings over uint8 and 65% over uint4 models.

03

Achieves 31.9-146.6% faster inference on edge devices like NVIDIA JETSON P3450.

Abstract

Large Language Models (LLMs) achieve strong performance across tasks, but face storage and compute challenges on edge devices. We propose EntroLLM, a compression framework combining mixed quantization and entropy coding to reduce storage while preserving accuracy. We use a combination of unsigned and asymmetric quantization. Tensor-level quantization produces an entropy-reducing effect, increasing weight compressibility, and improving downstream Huffman encoding by $7 \times$ (8-bit) and $11.3 \times$ (4-bit) over state-of-the-art methods. Huffman coding further reduces memory bandwidth demands, while a parallel decoding strategy enables efficient weight retrieval with minimal latency. Experiments on edge-scale LLMs (smolLM-1.7B, phi3-mini-4k, mistral-7B) show up to $30%$ storage savings over uint8 and $65%$ over uint4 models, with $31.9 - 146.6%$ faster inference on memory-limited…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.