Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization
Clemens JS Schaefer, Navid Lambert-Shirzad, Xiaofan Zhang, Chiachen, Chou, Tom Jablin, Jian Li, Elfie Guo, Caitlin Stanton, Siddharth Joshi, Yu, Emma Wang

TL;DR
This paper introduces a mixed-precision post-training quantization method that enhances layer sensitivity estimation by incorporating inter-layer dependencies, leading to significant latency reductions while preserving model accuracy.
Contribution
It proposes augmenting Hessian-based sensitivity analysis with inter-layer dependency information to improve quantization decisions in neural networks.
Findings
Achieves up to 33.28% latency reduction on BERT
Maintains model accuracy within 99.99% of baseline
Improves accuracy-latency trade-offs across multiple models
Abstract
Efficiently serving neural network models with low latency is becoming more challenging due to increasing model complexity and parameter count. Model quantization offers a solution which simultaneously reduces memory footprint and compute requirements. However, aggressive quantization may lead to an unacceptable loss in model accuracy owing to differences in sensitivity to numerical imperfection across different layers in the model. To address this challenge, we propose a mixed-precision post training quantization (PTQ) approach that assigns different numerical precisions to tensors in a network based on their specific needs, for a reduced memory footprint and improved latency while preserving model accuracy. Previous works rely on layer-wise Hessian information to determine numerical precision, but as we demonstrate, Hessian estimation is typically insufficient in determining an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Tensor decomposition and applications · Domain Adaptation and Few-Shot Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Pointwise Convolution · Batch Normalization · Dropout · Linear Layer · Depthwise Convolution · Attention Dropout · Linear Warmup With Linear Decay
