A Selective Quantization Tuner for ONNX Models

Nikolaos Louloudakis; Ajitha Rajan

arXiv:2507.12196·cs.LG·February 3, 2026

A Selective Quantization Tuner for ONNX Models

Nikolaos Louloudakis, Ajitha Rajan

PDF

Open Access

TL;DR

SeQTO is a framework for selective quantization of ONNX models that balances accuracy and efficiency through profiling and multi-objective optimization, suitable for diverse hardware.

Contribution

We introduce SeQTO, a novel framework enabling optimized selective quantization and deployment of ONNX models across various hardware using Pareto optimization.

Findings

01

Achieves up to 54.14% lower accuracy loss

02

Maintains up to 98.18% size reduction

03

Effective across CPU and GPU devices

Abstract

Quantization reduces the precision of deep neural networks to lower model size and computational demands, but often at the expense of accuracy. Fully quantized models can suffer significant accuracy degradation, and resource-constrained hardware accelerators may not support all quantized operations. A common workaround is selective quantization, where only some layers are quantized while others remain at full precision. However, determining the optimal balance between accuracy and efficiency is a challenging task. To this direction, we propose SeQTO, a framework that enables selective quantization, deployment, and execution of ONNX models on diverse CPU and GPU devices, combined with profiling and multi-objective optimization. SeQTO generates selectively quantized models, deploys them across hardware accelerators, evaluates performance on metrics such as accuracy and size, applies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Control Systems Optimization