A Dataflow Compiler for Efficient LLM Inference using Custom   Microscaling Formats

Jianyi Cheng; Cheng Zhang; Zhewen Yu; Christos-Savvas Bouganis; George; A. Constantinides; Yiren Zhao

arXiv:2307.15517·cs.AR·April 22, 2024·1 cites

A Dataflow Compiler for Efficient LLM Inference using Custom Microscaling Formats

Jianyi Cheng, Cheng Zhang, Zhewen Yu, Christos-Savvas Bouganis, George, A. Constantinides, Yiren Zhao

PDF

Open Access

TL;DR

This paper introduces MASE, a compiler that utilizes mixed-precision Microscaling (MX) formats to enable efficient LLM inference with minimal accuracy loss and significant energy efficiency improvements.

Contribution

It presents a novel orchestration abstraction for optimizing mixed-precision MX formats on hardware accelerators for LLMs, achieving 4-bit inference with minimal accuracy degradation.

Findings

01

Achieves 4-bit LLM inference with minimal accuracy loss

02

Improves energy efficiency by 24% over 8-bit fixed-point designs

03

First to leverage fine-grain multi-precision MX formats in LLM hardware

Abstract

Model quantization represents both parameters (weights) and intermediate values (activations) in a more compact format, thereby directly reducing both computational and memory cost in hardware. The quantization of recent large language models (LLMs) faces challenges to achieve competitive memory density compared to other models such as convolutional neural networks, since values in LLMs require larger dynamic ranges. Current hardware can expedite computation for LLMs using compact numerical formats such as low-bitwidth integers or floating-point numbers. Each has advantages: integer operations simplify circuit design, whereas floating-point calculations can enhance accuracy when a wider dynamic range is required. In this work, we seek an efficient data format that combines the best of both worlds: Microscaling (MX) formats. MX formats are efficient data formats that achieve both large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Advanced Neural Network Applications