# End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost

**Authors:** Qitao Tan, Xiaoying Song, Jin Lu, Guoming Li, Jun Liu, Lingzi Hong, Caiwen Ding, Jundong Li, Xiaoming Zhai, Shaoyi Huang, Wei Niu, Geng Yuan

arXiv: 2509.00031 · 2025-09-30

## TL;DR

ZeroQAT introduces a memory-efficient, zeroth-order optimization-based quantization-aware training method for large language models, enabling effective on-device fine-tuning at extremely low bit-widths with minimal resource requirements.

## Contribution

It proposes ZeroQAT, a novel zeroth-order QAT framework that eliminates backpropagation, reducing memory and computational costs for LLM quantization and fine-tuning.

## Key findings

- ZeroQAT outperforms traditional PTQ and QAT methods in accuracy.
- Enables fine-tuning of large models on resource-limited devices.
- Significantly reduces memory usage during quantization-aware training.

## Abstract

Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing PTQ methods are limited by their inability to fine-tune model parameters and often suffer significant accuracy loss in low-bit scenarios. Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs, limiting its practicality for LLM deployment. To address these challenges, we propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization. ZeroQAT leverages forward-only gradient estimation to eliminate backpropagation, substantially reducing computational and memory overhead while retaining the benefits of end-to-end optimization. We further introduce a lightweight variant of ZeroQAT for quantized fine-tuning, which freezes and pre-quantizes most parameters to further cut memory usage. Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory. For example, ZeroQAT enables fine-tuning of a 13B model at extremely low bit-widths (e.g., 2-4 bits) on a single 8GB GPU, and even allows fine-tuning a 6.7B model on a OnePlus 12 smartphone, demonstrating its practicality for end-to-end QAT on resource-limited edge devices.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00031/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00031/full.md

## References

56 references — full list in the complete paper: https://tomesphere.com/paper/2509.00031/full.md

---
Source: https://tomesphere.com/paper/2509.00031