MF-QAT: Multi-Format Quantization-Aware Training for Elastic Inference
Zifei Xu, Sayeh Sharify, Hesham Mostafa

TL;DR
This paper introduces a multi-format quantization-aware training method that enables a single model to perform well across various numerical precisions and supports elastic inference with on-the-fly format conversion.
Contribution
It proposes a novel multi-format QAT approach and a practical conversion pipeline for elastic inference, allowing dynamic precision scaling during deployment.
Findings
Multi-format QAT matches single-format performance at each precision.
The Slice-and-Scale conversion enables format changes without re-training.
The pipeline supports runtime format selection with minimal accuracy loss.
Abstract
Quantization-aware training (QAT) is typically performed for a single target numeric format, while practical deployments often need to choose numerical precision at inference time based on hardware support or runtime constraints. We study multi-format QAT, where a single model is trained to be robust across multiple quantization formats. We find that multi-format QAT can match single-format QAT at each target precision, yielding one model that performs well overall across different formats, even formats that were not seen during training. To enable practical deployment, we propose the Slice-and-Scale conversion procedure for both MXINT and MXFP that converts a high-precision representation into lower-precision formats without re-training. Building on this, we introduce a pipeline that (i) trains a model with multi-format QAT, (ii) stores a single anchor format checkpoint (MXINT8/MXFP8),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
