ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Ruibo Fan; Xiangrui Yu; Xinglin Pan; Zeyu Li; Weile Luo; Qiang Wang; Wei Wang; Xiaowen Chu

arXiv:2603.17435·cs.DC·March 19, 2026·ASPLOS

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

Ruibo Fan, Xiangrui Yu, Xinglin Pan, Zeyu Li, Weile Luo, Qiang Wang, Wei Wang, Xiaowen Chu

PDF

Open Access

TL;DR

ZipServ is a novel lossless compression framework for LLM inference that significantly reduces model size and accelerates GPU-based inference by co-designing compression and computation kernels for hardware efficiency.

Contribution

It introduces TCA-TBE encoding and a fused decompression-GEMM kernel, enabling efficient, lossless, and hardware-aware LLM inference acceleration.

Findings

01

Model size reduced by up to 30%

02

Achieves up to 2.21x kernel speedup

03

Speeds up end-to-end inference by 1.22x

Abstract

Lossless model compression holds tremendous promise for alleviating the memory and bandwidth bottlenecks in bit-exact Large Language Model (LLM) serving. However, existing approaches often result in substantial inference slowdowns due to fundamental design mismatches with GPU architectures: at the kernel level, variable-length bitstreams produced by traditional entropy codecs break SIMT parallelism; at the system level, decoupled pipelines lead to redundant memory traffic. We present ZipServ, a lossless compression framework co-designed for efficient LLM inference. ZipServ introduces Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE), a novel fixed-length format that enables constant-time, parallel decoding, together with a fused decompression-GEMM (ZipGEMM) kernel that decompresses weights on-the-fly directly into Tensor Core registers. This "load-compressed, compute-decompressed"…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Parallel Computing and Optimization Techniques