QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads

Muhammad Ahmad; Khurram Mazher; Saqib Akram; Ahmad Tameem; Saad Bin Nasir

arXiv:2505.07531·cs.AI·September 15, 2025

QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads

Muhammad Ahmad, Khurram Mazher, Saqib Akram, Ahmad Tameem, Saad Bin Nasir

PDF

Open Access

TL;DR

QuantX introduces a hardware-aware quantization framework for large language and vision models, enabling 3-bit quantization with minimal performance loss and improved efficiency over existing methods.

Contribution

It provides a novel suite of quantization recipes tailored for hardware constraints, achieving near-original performance and integrating into popular frameworks.

Findings

01

QuantX quantizes models to 3 bits with less than 6% performance loss.

02

It outperforms recent state-of-the-art quantization techniques.

03

Integration into Llama.cpp demonstrates practical runtime efficiency.

Abstract

We present QuantX: a tailored suite of recipes for LLM and VLM quantization. It is capable of quantizing down to 3-bit resolutions with minimal loss in performance. The quantization strategies in QuantX take into account hardware-specific constraints to achieve efficient dequantization during inference ensuring flexible trade-off between runtime speed, memory requirement and model accuracy. Our results demonstrate that QuantX achieves performance within 6% of the unquantized model for LlaVa-v1.6 quantized down to 3-bits for multiple end user tasks and outperforms recently published state-of-the-art quantization techniques. We further integrate one particular technique from QuantX into the popular Llama.cpp framework and show its feasibility in terms of runtime compared to the mainstream quantization techniques from Llama.cpp. Lastly, this manuscript provides insights into the LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Embedded Systems Design Techniques · Parallel Computing and Optimization Techniques