A Dataflow Compiler for Efficient LLM Inference using Custom Microscaling Formats
Jianyi Cheng, Cheng Zhang, Zhewen Yu, Christos-Savvas Bouganis, George, A. Constantinides, Yiren Zhao

TL;DR
This paper introduces MASE, a compiler that utilizes mixed-precision Microscaling (MX) formats to enable efficient LLM inference with minimal accuracy loss and significant energy efficiency improvements.
Contribution
It presents a novel orchestration abstraction for optimizing mixed-precision MX formats on hardware accelerators for LLMs, achieving 4-bit inference with minimal accuracy degradation.
Findings
Achieves 4-bit LLM inference with minimal accuracy loss
Improves energy efficiency by 24% over 8-bit fixed-point designs
First to leverage fine-grain multi-precision MX formats in LLM hardware
Abstract
Model quantization represents both parameters (weights) and intermediate values (activations) in a more compact format, thereby directly reducing both computational and memory cost in hardware. The quantization of recent large language models (LLMs) faces challenges to achieve competitive memory density compared to other models such as convolutional neural networks, since values in LLMs require larger dynamic ranges. Current hardware can expedite computation for LLMs using compact numerical formats such as low-bitwidth integers or floating-point numbers. Each has advantages: integer operations simplify circuit design, whereas floating-point calculations can enhance accuracy when a wider dynamic range is required. In this work, we seek an efficient data format that combines the best of both worlds: Microscaling (MX) formats. MX formats are efficient data formats that achieve both large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Advanced Neural Network Applications
