Automatic mixed precision for optimizing gained time with constrained loss mean-squared-error based on model partition to sequential sub-graphs

Shmulik Markovich-Golan; Daniel Ohayon; Itay Niv; Yair Hanani

arXiv:2505.13060·cs.LG·May 20, 2025

Automatic mixed precision for optimizing gained time with constrained loss mean-squared-error based on model partition to sequential sub-graphs

Shmulik Markovich-Golan, Daniel Ohayon, Itay Niv, Yair Hanani

PDF

Open Access

TL;DR

This paper introduces an automatic mixed precision method for neural network quantization that optimizes inference time by modeling model partitioning and sensitivity, validated on large language models with hardware-aware predictions.

Contribution

It proposes a novel sensitivity metric based on a Taylor series expansion and an IP-based optimization for mixed precision configuration considering hardware constraints.

Findings

01

Effective sensitivity metric for layer-wise quantization

02

Hardware-aware time gain prediction model

03

Validated on large language models with real hardware

Abstract

Quantization is essential for Neural Network (NN) compression, reducing model size and computational demands by using lower bit-width data types, though aggressive reduction often hampers accuracy. Mixed Precision (MP) mitigates this tradeoff by varying the numerical precision across network layers. This study focuses on automatically selecting an optimal MP configuration within Post-Training Quantization (PTQ) for inference. The first key contribution is a novel sensitivity metric derived from a first-order Taylor series expansion of the loss function as a function of quantization errors in weights and activations. This metric, based on the Mean Square Error (MSE) of the loss, is efficiently calculated per layer using high-precision forward and backward passes over a small calibration dataset. The metric is additive across layers, with low calibration memory overhead as weight…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Embedded Systems Design Techniques · Advanced Data Compression Techniques

MethodsSparse Evolutionary Training