Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision   Post-Training Quantization

Clemens JS Schaefer; Navid Lambert-Shirzad; Xiaofan Zhang; Chiachen; Chou; Tom Jablin; Jian Li; Elfie Guo; Caitlin Stanton; Siddharth Joshi; Yu; Emma Wang

arXiv:2306.04879·cs.LG·June 9, 2023·2 cites

Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization

Clemens JS Schaefer, Navid Lambert-Shirzad, Xiaofan Zhang, Chiachen, Chou, Tom Jablin, Jian Li, Elfie Guo, Caitlin Stanton, Siddharth Joshi, Yu, Emma Wang

PDF

Open Access

TL;DR

This paper introduces a mixed-precision post-training quantization method that enhances layer sensitivity estimation by incorporating inter-layer dependencies, leading to significant latency reductions while preserving model accuracy.

Contribution

It proposes augmenting Hessian-based sensitivity analysis with inter-layer dependency information to improve quantization decisions in neural networks.

Findings

01

Achieves up to 33.28% latency reduction on BERT

02

Maintains model accuracy within 99.99% of baseline

03

Improves accuracy-latency trade-offs across multiple models

Abstract

Efficiently serving neural network models with low latency is becoming more challenging due to increasing model complexity and parameter count. Model quantization offers a solution which simultaneously reduces memory footprint and compute requirements. However, aggressive quantization may lead to an unacceptable loss in model accuracy owing to differences in sensitivity to numerical imperfection across different layers in the model. To address this challenge, we propose a mixed-precision post training quantization (PTQ) approach that assigns different numerical precisions to tensors in a network based on their specific needs, for a reduced memory footprint and improved latency while preserving model accuracy. Previous works rely on layer-wise Hessian information to determine numerical precision, but as we demonstrate, Hessian estimation is typically insufficient in determining an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Tensor decomposition and applications · Domain Adaptation and Few-Shot Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Pointwise Convolution · Batch Normalization · Dropout · Linear Layer · Depthwise Convolution · Attention Dropout · Linear Warmup With Linear Decay