A Selective Quantization Tuner for ONNX Models
Nikolaos Louloudakis, Ajitha Rajan

TL;DR
SeQTO is a framework for selective quantization of ONNX models that balances accuracy and efficiency through profiling and multi-objective optimization, suitable for diverse hardware.
Contribution
We introduce SeQTO, a novel framework enabling optimized selective quantization and deployment of ONNX models across various hardware using Pareto optimization.
Findings
Achieves up to 54.14% lower accuracy loss
Maintains up to 98.18% size reduction
Effective across CPU and GPU devices
Abstract
Quantization reduces the precision of deep neural networks to lower model size and computational demands, but often at the expense of accuracy. Fully quantized models can suffer significant accuracy degradation, and resource-constrained hardware accelerators may not support all quantized operations. A common workaround is selective quantization, where only some layers are quantized while others remain at full precision. However, determining the optimal balance between accuracy and efficiency is a challenging task. To this direction, we propose SeQTO, a framework that enables selective quantization, deployment, and execution of ONNX models on diverse CPU and GPU devices, combined with profiling and multi-objective optimization. SeQTO generates selectively quantized models, deploys them across hardware accelerators, evaluates performance on metrics such as accuracy and size, applies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Control Systems Optimization
