Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression

Feng Cheng; Cong Guo; Chiyue Wei; Junyao Zhang; Changchun Zhou; Edward Hanson; Jiaqi Zhang; Xiaoxiao Liu; Hai "Helen" Li; Yiran Chen

arXiv:2505.06901·cs.AR·May 13, 2025

Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression

Feng Cheng, Cong Guo, Chiyue Wei, Junyao Zhang, Changchun Zhou, Edward Hanson, Jiaqi Zhang, Xiaoxiao Liu, Hai "Helen" Li, Yiran Chen

PDF

Open Access

TL;DR

Ecco is an entropy-aware cache compression method for LLMs that significantly reduces memory and latency overheads while maintaining accuracy, enabling more efficient deployment of large-scale models.

Contribution

Ecco introduces a novel entropy-based cache compression technique combining group-wise quantization, shared patterns, and parallel Huffman decoding for LLMs.

Findings

01

Achieves up to 2.9× speedup over state-of-the-art methods.

02

Increases memory capacity by nearly 4× without accuracy loss.

03

Reduces Huffman decoding latency by two orders of magnitude.

Abstract

Large language models (LLMs) have demonstrated transformative capabilities across diverse artificial intelligence applications, yet their deployment is hindered by substantial memory and computational demands, especially in resource-constrained environments. Quantization techniques have emerged as a critical solution, reducing data precision to enhance memory and computational efficiency. However, existing methods often suffer from high runtime overheads and potential accuracy degradation. To address these challenges, we propose Ecco, an entropy-based cache compression technique tailored for LLMs. Ecco combines group-wise and non-uniform quantization with pre-defined shared k-means patterns and Huffman coding to exploit the inherent entropy characteristics of LLM cache data. Recognizing the inefficiencies of traditional Huffman coding in terms of parallelism and latency, we introduce a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Parallel Computing and Optimization Techniques · Big Data and Digital Economy