LEXI: Lossless Exponent Coding for Efficient Inter-Chiplet Communication in Hybrid LLMs
Miao Sun, Alish Kanani, Kaushik Shroff, Umit Ogras

TL;DR
LEXI is a lossless exponent compression scheme that significantly reduces inter-chiplet communication overheads in hybrid LLMs, leading to notable latency improvements without impacting accuracy.
Contribution
The paper introduces LEXI, a novel Huffman-based lossless exponent compression method tailored for efficient inter-chiplet communication in large language models.
Findings
Reduces inter-chiplet communication by 33-45%.
Decreases end-to-end inference latency by 30-35%.
Imposes only 0.09% area and energy overheads.
Abstract
Data movement overheads increase the inference latency of state-of-the-art large language models (LLMs). These models commonly use the bfloat16 (BF16) format for stable training. Floating-point standards allocate eight bits to the exponent, but our profiling reveals that exponent streams exhibit fewer than 3 bits Shannon entropy, indicating high inherent compressibility. To exploit this potential, we propose LEXI, a novel lossless exponent compression scheme based on Huffman coding. LEXI compresses activations and caches on the fly while storing compressed weights for just-in-time decompression near compute, without sacrificing system throughput and model accuracy. The codecs at the ingress and egress ports of network-on-chip routers sustain the maximum link bandwidth via multi-lane LUT decoders, incurring only 0.09 percent area and energy overheads with GF 22 nm technology. LEXI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Interconnection Networks and Systems · Network Packet Processing and Optimization
