UNComp: Can Matrix Entropy Uncover Sparsity? -- A Compressor Design from an Uncertainty-Aware Perspective

Jing Xiong; Jianghan Shen; Fanghua Ye; Chaofan Tao; Zhongwei Wan; Jianqiao Lu; Xun Wu; Chuanyang Zheng; Zhijiang Guo; Min Yang; Lingpeng Kong; Ngai Wong

arXiv:2410.03090·cs.CL·September 25, 2025

UNComp: Can Matrix Entropy Uncover Sparsity? -- A Compressor Design from an Uncertainty-Aware Perspective

Jing Xiong, Jianghan Shen, Fanghua Ye, Chaofan Tao, Zhongwei Wan, Jianqiao Lu, Xun Wu, Chuanyang Zheng, Zhijiang Guo, Min Yang, Lingpeng Kong, Ngai Wong

PDF

Open Access

TL;DR

UNComp introduces an uncertainty-aware method using matrix entropy to identify sparsity in large language models, enabling adaptive compression that significantly reduces memory usage and accelerates inference.

Contribution

This work presents UNComp, a novel framework that leverages uncertainty and matrix entropy to detect sparsity patterns for optimized adaptive compression in LLMs.

Findings

01

Reduces KV cache size to 4.74% of original

02

Achieves 6% prefill speedup

03

Improves throughput by 6.4x

Abstract

Deploying large language models (LLMs) for long-context inference remains challenging due to their substantial memory and computational demands. While techniques such as Key-Value (KV) cache compression are designed to reduce memory usage, they often neglect the structured sparsity inherent in the relationship between hidden states and their corresponding KV cache. In this work, we explore the role of uncertainty as a potential indicator of sparsity within LLMs. We propose UNComp, an uncertainty-aware framework that leverages truncated matrix entropy to identify areas of low information content, thereby revealing sparsity patterns that can be used for adaptive compression. Unlike traditional methods that apply uniform compression, UNComp dynamically adjusts its approach to compression, guided by uncertainty measures that reflect the importance of various model components. Our analysis…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Softmax · Dense Connections · Feedforward Network · Grouped-query attention · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings