SmartQuant: CXL-based AI Model Store in Support of Runtime Configurable   Weight Quantization

Rui Xie; Asad Ul Haq; Linsen Ma; Krystal Sun; Sanchari Sen; Swagath; Venkataramani; Liu Liu; Tong Zhang

arXiv:2407.15866·cs.LG·August 20, 2024

SmartQuant: CXL-based AI Model Store in Support of Runtime Configurable Weight Quantization

Rui Xie, Asad Ul Haq, Linsen Ma, Krystal Sun, Sanchari Sen, Swagath, Venkataramani, Liu Liu, Tong Zhang

PDF

Open Access

TL;DR

This paper proposes a CXL-based AI model store that enables runtime configurable weight quantization, improving inference efficiency, memory access speed, and energy efficiency for transformer models by leveraging hardware support and active CXL memory controllers.

Contribution

It introduces a novel CXL-based design that allows runtime configurable weight quantization to enhance AI inference performance and efficiency, filling a research gap in hardware exploitation.

Findings

01

Demonstrated improved inference efficiency on transformer models

02

Showed increased memory access speed and energy savings

03

Validated effectiveness through experimental results

Abstract

Recent studies have revealed that, during the inference on generative AI models such as transformer, the importance of different weights exhibits substantial context-dependent variations. This naturally manifests a promising potential of adaptively configuring weight quantization to improve the generative AI inference efficiency. Although configurable weight quantization can readily leverage the hardware support of variable-precision arithmetics in modern GPU and AI accelerators, little prior research has studied how one could exploit variable weight quantization to proportionally improve the AI model memory access speed and energy efficiency. Motivated by the rapidly maturing CXL ecosystem, this work develops a CXL-based design solution to fill this gap. The key is to allow CXL memory controllers play an active role in supporting and exploiting runtime configurable weight quantization.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Model-Driven Software Engineering Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings