End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost
Qitao Tan, Xiaoying Song, Jin Lu, Guoming Li, Jun Liu, Lingzi Hong, Caiwen Ding, Jundong Li, Xiaoming Zhai, Shaoyi Huang, Wei Niu, Geng Yuan

TL;DR
ZeroQAT introduces a memory-efficient, zeroth-order optimization-based quantization-aware training method for large language models, enabling effective on-device fine-tuning at extremely low bit-widths with minimal resource requirements.
Contribution
It proposes ZeroQAT, a novel zeroth-order QAT framework that eliminates backpropagation, reducing memory and computational costs for LLM quantization and fine-tuning.
Findings
ZeroQAT outperforms traditional PTQ and QAT methods in accuracy.
Enables fine-tuning of large models on resource-limited devices.
Significantly reduces memory usage during quantization-aware training.
Abstract
Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing PTQ methods are limited by their inability to fine-tune model parameters and often suffer significant accuracy loss in low-bit scenarios. Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs, limiting its practicality for LLM deployment. To address these challenges, we propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization. ZeroQAT leverages forward-only gradient estimation to eliminate backpropagation, substantially reducing computational and memory overhead while retaining the benefits of end-to-end optimization. We further introduce a…
| Method | Category |
|
Low-bit performance |
|
|||||
| Pre-train | Fine-tune | ||||||||
| SmoothQuant | Range PTQ | WA | ✗ | ✗ | High | ||||
| GPTQ | Approx PTQ | WO | ✗ | ✗ | High | ||||
| OmniQuant | Approx PTQ | WA | ✓ | ✗ | Moderate | ||||
| LLM-QAT | Full QAT | WA | ✓ | ✓ | Low | ||||
| QLoRA | PEFT QAT | WO | ✓ | ✓ | Moderate | ||||
| EfficientQAT | PEFT QAT | WO | ✓ | ✓ | High | ||||
| ZeroQAT | Full/PEFT QAT | WA | ✓ | ✓ | High | ||||
| Method | Category | Quantized Pre-training (PPL ) | Quantized Fine-tuning (Acc ) | ||||
| W6A6 | W2A16 | W4A4 | W6A6 | W2A16g128 | W4A4 | ||
| ZO (FP16) | - | 5.47 | 66.0 | ||||
| Zero-shot | - | - | 41.3 | ||||
| SmoothQuant | Range-based PTQ | 6.20 | 100.23‡ | 83.12 | 57.2 | 27.7‡ | 32.9‡ |
| OmniQuant | Approx-based PTQ | 5.87 | 37.37 | 14.26 | 63.9 | 40.6‡ | 38.8‡ |
| EfficientQAT | QAT | 5.60 | 33.40 | 76.32‡ | 66.4 | 45.4 | 28.6‡ |
| ZeroQAT | QAT | 5.76 | 29.61 | 12.95 | 65.3 | 54.1 | 55.7 |
| Llama / PPL | Llama1-7B | Llama1-13B | Llama2-7B | Llama2-13B | |||||
| Task | WIKI | C4 | WIKI | C4 | WIKI | C4 | WIKI | C4 | |
| FP16 | - | 5.68 | 7.08 | 5.09 | 6.61 | 5.47 | 6.97 | 4.88 | 6.46 |
| RTN | 1.1e5 | 1.3e5 | 6.8e4 | 5.6e4 | 3.8e4 | 4.8e4 | 5.6e4 | 7.2e4 | |
| GPTQ | 5.6e4 | 689.13 | 5.5e3 | 6.97 | 7.7e3 | NAN | 2.1e3 | 323.12 | |
| OmniQuant | 15.47 | 24.89 | 13.21 | 18.31 | 37.37 | 90.64 | 17.21 | 26.76 | |
| W2A16 | \cellcolor[HTML]EFEFEFZeroQAT | \cellcolor[HTML]EFEFEF12.85 | \cellcolor[HTML]EFEFEF17.47 | \cellcolor[HTML]EFEFEF10.29 | \cellcolor[HTML]EFEFEF15.37 | \cellcolor[HTML]EFEFEF29.61 | \cellcolor[HTML]EFEFEF55.34 | \cellcolor[HTML]EFEFEF15.97 | \cellcolor[HTML]EFEFEF24.68 |
| SmoothQuant | 6.03 | 7.47 | 5.42 | 6.97 | 6.20 | 7.76 | 5.18 | 6.76 | |
| OmniQuant | 5.96 | 7.43 | 5.28 | 6.84 | 5.87 | 7.48 | 5.14 | 6.74 | |
| W6A6 | \cellcolor[HTML]EFEFEFZeroQAT | \cellcolor[HTML]EFEFEF5.85 | \cellcolor[HTML]EFEFEF7.47 | \cellcolor[HTML]EFEFEF5.96 | \cellcolor[HTML]EFEFEF7.01 | \cellcolor[HTML]EFEFEF5.76 | \cellcolor[HTML]EFEFEF8.81 | \cellcolor[HTML]EFEFEF5.10 | \cellcolor[HTML]EFEFEF6.70 |
| SmoothQuant | 25.25 | 32.32 | 40.05 | 47.18 | 83.12 | 77.27 | 35.88 | 43.19 | |
| OmniQuant | 11.26 | 14.51 | 10.87 | 13.78 | 14.26 | 18.02 | 12.30 | 14.55 | |
| W4A4 | \cellcolor[HTML]EFEFEFZeroQAT | \cellcolor[HTML]EFEFEF11.10 | \cellcolor[HTML]EFEFEF14.78 | \cellcolor[HTML]EFEFEF10.04 | \cellcolor[HTML]EFEFEF12.65 | \cellcolor[HTML]EFEFEF12.95 | \cellcolor[HTML]EFEFEF16.73 | \cellcolor[HTML]EFEFEF10.41 | \cellcolor[HTML]EFEFEF12.43 |
| Llama / Acc | #Bits | Method | PIQA | ARC-e | ARC-c | HellaSwag | Winogrande | Avg. |
| FP16 | - | 77.47 | 72.38 | 41.46 | 73.00 | 67.07 | 65.26 | |
| W2A16 | RTN | 47.33 | 28.17 | 25.17 | 25.10 | 47.50 | 34.67 | |
| W2A16 | GPTQ | 57.38 | 36.62 | 25.00 | 42.50 | 49.38 | 40.35 | |
| W2A16 | EfficientQAT | 62.25 | 48.12 | 27.75 | 47.50 | 53.37 | 47.65 | |
| W2A16 | \cellcolor[HTML]E0E0E0ZeroQAT | \cellcolor[HTML]E0E0E068.25 | \cellcolor[HTML]E0E0E053.87 | \cellcolor[HTML]E0E0E027.62 | \cellcolor[HTML]E0E0E051.62 | \cellcolor[HTML]E0E0E057.38 | \cellcolor[HTML]E0E0E051.75 | |
| W4A4 | SmoothQuant | 49.80 | 30.40 | 25.80 | 27.40 | 48.00 | 38.41 | |
| W4A4 | LLM-QAT | 51.50 | 32.57 | 28.63 | 31.10 | 51.90 | 41.39 | |
| W4A4 | LLM-QAT+SQ | 55.93 | 35.90 | 30.60 | 44.80 | 50.60 | 46.72 | |
| W4A4 | OS+ | 62.70 | 39.20 | 32.64 | 47.89 | 52.96 | 49.60 | |
| W4A4 | OmniQuant | 67.38 | 53.87 | 30.63 | 53.12 | 55.25 | 52.15 | |
| Llama-1-7B | W4A4 | \cellcolor[HTML]E0E0E0ZeroQAT | \cellcolor[HTML]E0E0E066.98 | \cellcolor[HTML]E0E0E054.12 | \cellcolor[HTML]E0E0E032.19 | \cellcolor[HTML]E0E0E057.85 | \cellcolor[HTML]E0E0E054.37 | \cellcolor[HTML]E0E0E053.11 |
| OPT / Acc | OPT-2.7B | OPT-6.7B | OPT-13B | ||||||||||
| Task | SST-2 | CB | SQuAD | DROP | SST-2 | CB | SQuAD | DROP | SST-2 | CB | SQuAD | DROP | |
| Zero-shot | 56.3 | 50.0 | 29.8 | 10.0 | 64.2 | 50.0 | 37.9 | 13.1 | 58.8 | 46.4 | 46.2 | 14.6 | |
| FP16 (ZO) | - | 90.0 | 69.6 | 68.7 | 22.9 | 90.2 | 71.4 | 76.0 | 26.4 | 91.4 | 67.9 | 84.7 | 30.9 |
| RTN | 44.4 | 44.6 | 0.0 | 0.0 | 59.2 | 50.0 | 0.0 | 0.0 | 53.5 | 50.0 | 0.0 | 0.0 | |
| QLoRA | 61.2 | 51.8 | 0.0 | 8.2 | 64.8 | 58.9 | 0.0 | 0.0 | 63.8 | 69.6 | 0.0 | 0.0 | |
| OmniQuant | 72.8 | 55.4 | 16.5 | 4.4 | 61.6 | 55.3 | 27.7 | 12.6 | 62.6 | 29.8 | 38.8 | 16.4 | |
| EfficientQAT | 76.6 | 57.1 | 29.0 | 12.6 | 75.6 | 58.9 | 32.4 | 14.6 | 81.2 | 62.5 | 46.7 | 16.9 | |
| W2A16g128 | \cellcolor[HTML]EFEFEFZeroQAT | \cellcolor[HTML]EFEFEF85.2 | \cellcolor[HTML]EFEFEF62.5 | \cellcolor[HTML]EFEFEF36.9 | \cellcolor[HTML]EFEFEF16.6 | \cellcolor[HTML]EFEFEF84.8 | \cellcolor[HTML]EFEFEF67.8 | \cellcolor[HTML]EFEFEF46.7 | \cellcolor[HTML]EFEFEF18.9 | \cellcolor[HTML]EFEFEF85.6 | \cellcolor[HTML]EFEFEF64.2 | \cellcolor[HTML]EFEFEF59.6 | \cellcolor[HTML]EFEFEF22.9 |
| SmoothQuant | 56.0 | 55.4 | 7.6 | 5.4 | 58.8 | 50.0 | 12.8 | 6.2 | 57.5 | 52.4 | 13.4 | 7.1 | |
| OmniQuant | 59.2 | 60.7 | 22.1 | 6.7 | 61.2 | 48.2 | 24.7 | 11.7 | 59.2 | 50.0 | 28.8 | 13.5 | |
| W4A4 | \cellcolor[HTML]EFEFEFZeroQAT | \cellcolor[HTML]EFEFEF87.8 | \cellcolor[HTML]EFEFEF66.1 | \cellcolor[HTML]EFEFEF47.8 | \cellcolor[HTML]EFEFEF13.3 | \cellcolor[HTML]EFEFEF87.9 | \cellcolor[HTML]EFEFEF64.3 | \cellcolor[HTML]EFEFEF51.1 | \cellcolor[HTML]EFEFEF19.3 | \cellcolor[HTML]EFEFEF88.2 | \cellcolor[HTML]EFEFEF62.1 | \cellcolor[HTML]EFEFEF62.4 | \cellcolor[HTML]EFEFEF24.3 |
| Method | #Bits | 7B | 13B |
| - | FP | 67.0 | 69.3 |
| QLoRA w/GPTQ | W2A16 | 31.8 | 32.4 |
| QA-LoRA | W2A16 | 34.6 | 37.3 |
| IR-QLoRA | W2A16 | 34.4 | 36.3 |
| PEQA | W2A16 | 35.2 | 34.8 |
| EfficientQAT | W2A16 | 49.1 | 52.1 |
| \rowcolor[gray].92ZeroQAT | W2A16 | 53.9 | 55.7 |
| SmoothQuant | W4A4 | 37.4 | 41.6 |
| OmniQuant | W4A4 | 52.3 | 54.2 |
| \rowcolor[gray].92ZeroQAT | W4A4 | 54.8 | 57.4 |
| Method | OPT-1.3B | OPT-2.7B | OPT-6.7B | OPT-13B | ||||
| Memory | Time | Memory | Time | Memory | Time | Memory | Time | |
| Quantized Pre-training (avg sequence length = 2048) | ||||||||
| LLM-QAT (bsz=1) | 28.8GB | 1.00s | 58.6GB | 1.64s | 166GB | 5.0s | 337GB | 15.5s |
| OmniQuant (bsz=1) | 6.1GB | 0.92s | 7.4GB | 1.49s | 12.3GB | 2.65s | 16.8GB | 4.77s |
| \rowcolor[gray].92ZeroQAT (bsz=1) | 3.1GB | 0.58s | 6.1GB | 0.98s | 14.2GB | 1.77s | 26.6GB | 3.12s |
| OmniQuant (bsz=4) | 14.7GB | 2.55s | 16.2GB | 4.03s | 22.5GB | 6.35s | 28.5GB | 11.81s |
| \rowcolor[gray].92ZeroQAT (bsz=4) | 3.1GB | 1.72s | 6.1GB | 2.76s | 14.2GB | 4.48s | 26.6GB | 7.74s |
| Quantized Fine-tuning (max sequence length = 384) | ||||||||
| EfficientQAT (bsz=1) | 2.1GB | 0.13s | 3.1GB | 0.21s | 4.4GB | 0.36s | 7.3GB | 0.67s |
| \rowcolor[gray].92ZeroQAT (bsz=1) | 0.8GB | 0.04s | 1.5GB | 0.07s | 3.7GB | 0.18s | 6.8GB | 0.32s |
| EfficientQAT (bsz=16) | 5.9GB | 0.69s | 8.1GB | 1.10s | 11.9GB | 1.70s | 17.2GB | 3.26s |
| \rowcolor[gray].92ZeroQAT (bsz=16) | 0.8GB | 0.31s | 1.5GB | 0.53s | 3.7GB | 0.94s | 6.8GB | 1.73s |
| Stage | Metrics | OPT-1.3B | OPT-2.7B | OPT-6.7B | |||
| FP16 | ZeroQAT | FP16 | ZeroQAT | FP16 | ZeroQAT | ||
| Fine-tuning | Latency | 11.2s | 7.8s | 19.6s | 12.3s | / | 29.1s |
| Weight memory | 2.6GB | 0.9GB | 5.4GB | 1.8GB | 13.4GB | 4.6GB | |
| Running memory | 3.5GB | 1.2GB | 8.1GB | 2.6GB | OOM | 6.4GB | |
| Inference | Token / s | 10.9 | 15.4 | 7.58 | 11.0 | 3.13 | 4.76 |
| Speed up | 1.0 | 1.41 | 1.0 | 1.45 | 1.0 | 1.52 | |
| Experiment | Hyperparameters | Values |
| Quantized Pre-training | Batch size | 4 |
| Iteration | 10K | |
| Learning rate | {5e-7, 1e-8} | |
| Lr for smothing | 5e-6 | |
| Lr for clipping | 1e-5 | |
| Lr schedule | Linear Decay | |
| in ZO | {1e-3, 5e-4 1e-4} | |
| Quantized Fine-tuning | Batch size | {32, 16} |
| Iteration | 8K | |
| Learning rate | {1e-6, 5e-7} | |
| Lr schedule | Constant | |
| in ZO | 1e-3 |
| PPL | Llama-7B | Llama2-13B | ||
| Leanable Components | W4A4 | W2A16 | W4A4 | W2A16 |
| Smoothing + Clipping | 12.95 | 29.32 | 10.41 | 16.04 |
| W/O Smoothing | 1.4e3 | 29.61 | 5.2e3 | 15.97 |
| W/O Clipping | 16.64 | 9.4e3 | 18.7 | 2.8e3 |
| W/O Smoothing & Clipping | 2.1e3 | 1.2e4 | 1.7e4 | 4.6e3 |
| Method | Fine-tuning Memory | PTQ memory | SST-2 | CB | SQuAD | DROP |
| FP ZO | 14.2 GB | - | 90.2 | 71.4 | 76.0 | 26.4 |
| OmniQuant (ZO) | 14.2 GB | 4.4 GB | 61.2 | 48.2 | 24.7 | 11.7 |
| OmniQuant (FO) | 98.6 GB | 4.4 GB | 58.7 | 55.3 | 31.8 | 13.5 |
| ZeroQAT | 3.7 GB | - | 87.9 | 64.3 | 51.1 | 19.3 |
| PPL | LLama2-7b | LLama2-13b | ||
| Method | W2A16 | W4A4 | W2A16 | W4A4 |
| ZeroQAT (LW) | 41.05 | 19.34 | 21.97 | 15.45 |
| ZeroQAT | 29.61 | 12.95 | 15.95 | 10.41 |
| Attn_Q | Attn_V | Attn_K | Attn_O | Dense | W2A16g128 | W4A4 | ||
| Acc. | Memory | Acc. | Memory | |||||
| ✓ | ✓ | ✓ | ✓ | ✓ | 55.0 | 100% | 56.8 | 91.7 |
| ✓ | ✓ | ✓ | ✓ | ✗ | 54.1 | 42% | 54.5 | 50% |
| ✓ | ✓ | ✓ | ✗ | ✗ | 54.3 | 34% | 55.4 | 44% |
| ✓ | ✓ | ✗ | ✗ | ✗ | 54.5 | 27% | 55.6 | 38% |
| ✓ | ✗ | ✗ | ✗ | ✗ | 44.3 | 20% | 46.9 | 32% |
| Epochs | LLama1-7B | LLama2-7B | OPT-6.7B |
| 0 | 14.33 | 15.67 | 15.49 |
| 1 | 11.68 | 13.87 | 12.53 |
| 2∗ | 11.10 | 12.95 | 11.48 |
| 10 | 10.86 | 12.38 | 11.12 |
| 20 | 10.20 | 12.08 | 10.95 |
| OPT / PPL | OPT-2.7B | OPT-6.7B | OPT-13B | |||||||
| Task | WIKI | PT | C4 | WIKI | PT | C4 | WIKI | PT | C4 | |
| FP16 | - | 12.47 | 15.13 | 13.16 | 10.86 | 13.09 | 11.74 | 10.13 | 12.34 | 11.20 |
| SmoothQuant | 12.64 | 15.91 | 13.34 | 11.34 | 13.82 | 12.14 | 10.56 | 12.76 | 11.40 | |
| RPTQ | 13.19 | 16.37 | 14.04 | 11.19 | 13.98 | 12.08 | 11.19 | 13.98 | 12.08 | |
| RPTQ* | 12.71 | 15.53 | 13.33 | 10.96 | 13.24 | 11.86 | 10.96 | 13.24 | 11.86 | |
| OmniQuant | 12.62 | 15.32 | 13.29 | 10.96 | 13.20 | 11.81 | 10.21 | 12.47 | 11.17 | |
| W6A6 | \cellcolor[HTML]EFEFEFZeroQAT | \cellcolor[HTML]EFEFEF12.62 | \cellcolor[HTML]EFEFEF15.37 | \cellcolor[HTML]EFEFEF13.77 | \cellcolor[HTML]EFEFEF10.14 | \cellcolor[HTML]EFEFEF13.41 | \cellcolor[HTML]EFEFEF11.44 | \cellcolor[HTML]EFEFEF9.60 | \cellcolor[HTML]EFEFEF12.59 | \cellcolor[HTML]EFEFEF11.47 |
| SmoothQuant | 131.47 | 107.10 | 120.57 | 1.8e4 | 1.4e4 | 1.5e4 | 7.4e3 | 6.5e3 | 5.6e3 | |
| RPTQ | 11.45 | 14.71 | 13.12 | 12.00 | 15.17 | 12.85 | 12.74 | 15.76 | 14.71 | |
| RPTQ* | 11.45 | 14.71 | 13.12 | 17.83 | 25.10 | 19.91 | 16.45 | 23.01 | 16.80 | |
| OmniQuant | 15.65 | 23.69 | 16.51 | 12.24 | 15.54 | 13.56 | 11.65 | 15.89 | 13.46 | |
| W4A4 | \cellcolor[HTML]EFEFEFZeroQAT | \cellcolor[HTML]EFEFEF14.42 | \cellcolor[HTML]EFEFEF21.71 | \cellcolor[HTML]EFEFEF15.14 | \cellcolor[HTML]EFEFEF11.48 | \cellcolor[HTML]EFEFEF14.84 | \cellcolor[HTML]EFEFEF13.10 | \cellcolor[HTML]EFEFEF10.65 | \cellcolor[HTML]EFEFEF15.04 | \cellcolor[HTML]EFEFEF12.62 |
| LLama / Acc | #Bits | Method | PIQA | ARC-e | ARC-c | HellaSwag | Winogrande | Avg. |
| FP16 | - | 77.47 | 72.38 | 41.46 | 73.00 | 67.07 | 65.26 | |
| W2A16 | RTN | 47.33 | 28.17 | 25.17 | 25.10 | 47.50 | 34.67 | |
| W2A16 | GPTQ | 57.38 | 36.62 | 25.00 | 42.50 | 49.38 | 40.35 | |
| W2A16 | EfficientQAT | 62.25 | 48.12 | 27.75 | 47.50 | 53.37 | 47.65 | |
| W2A16 | \cellcolor[HTML]EFEFEFZeroQAT | \cellcolor[HTML]EFEFEF68.25 | \cellcolor[HTML]EFEFEF53.87 | \cellcolor[HTML]EFEFEF27.62 | \cellcolor[HTML]EFEFEF51.62 | \cellcolor[HTML]EFEFEF57.38 | \cellcolor[HTML]EFEFEF51.75 | |
| W4A4 | SmoothQuant | 49.80 | 30.40 | 25.80 | 27.40 | 48.00 | 38.41 | |
| W4A4 | LLM-QAT | 51.50 | 32.57 | 28.63 | 31.10 | 51.90 | 41.39 | |
| W4A4 | LLM-QAT+SQ | 55.93 | 35.90 | 30.60 | 44.80 | 50.60 | 46.72 | |
| W4A4 | OS+ | 62.70 | 39.20 | 32.64 | 47.89 | 52.96 | 49.60 | |
| W4A4 | OmniQuant | 67.38 | 53.87 | 30.63 | 53.12 | 55.25 | 52.15 | |
| LLama-1-7B | W4A4 | \cellcolor[HTML]EFEFEFZeroQAT | \cellcolor[HTML]EFEFEF66.98 | \cellcolor[HTML]EFEFEF54.12 | \cellcolor[HTML]EFEFEF32.19 | \cellcolor[HTML]EFEFEF57.85 | \cellcolor[HTML]EFEFEF54.37 | \cellcolor[HTML]EFEFEF53.11 |
| FP16 | - | 79.10 | 74.83 | 42.04 | 75.62 | 70.31 | 66.33 | |
| W2A16 | RTN | 54.75 | 26.25 | 27.50 | 29.75 | 47.00 | 37.05 | |
| W2A16 | GPTQ | 59.25 | 33.00 | 25.17 | 44.25 | 53.25 | 42.98 | |
| W2A16 | EfficientQAT | 68.15 | 53.08 | 29.51 | 49.26 | 54.35 | 50.87 | |
| W2A16 | \cellcolor[HTML]EFEFEFZeroQAT | \cellcolor[HTML]EFEFEF72.41 | \cellcolor[HTML]EFEFEF57.24 | \cellcolor[HTML]EFEFEF32.12 | \cellcolor[HTML]EFEFEF53.70 | \cellcolor[HTML]EFEFEF57.54 | \cellcolor[HTML]EFEFEF54.60 | |
| W4A4 | SmoothQuant | 61.04 | 38.00 | 26.27 | 41.20 | 50.64 | 43.43 | |
| W4A4 | OS+ | 66.73 | 41.43 | 29.33 | 48.67 | 52.80 | 47.79 | |
| W4A4 | OmniQuant | 69.69 | 56.22 | 33.10 | 58.96 | 55.80 | 54.75 | |
| LLama-1-13B | W4A4 | \cellcolor[HTML]EFEFEFZeroQAT | \cellcolor[HTML]EFEFEF71.86 | \cellcolor[HTML]EFEFEF58.27 | \cellcolor[HTML]EFEFEF32.68 | \cellcolor[HTML]EFEFEF57.16 | \cellcolor[HTML]EFEFEF56.35 | \cellcolor[HTML]EFEFEF55.26 |
| Llama-7B (FP: 38.41%) | GPTQ | EfficientQAT | SmoothQuant | OmniQuant | ZeroQAT |
| W2A16 | 23.71% | 24.74% | - | 25.65% | 26.57% |
| W4A4 | - | - | 24.55% | 26.93% | 27.61% |
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The idea is interesting and very suitable for accurate memory-efficient compression. - Motivation by multiple studies on PTQ and QAT methods. - Including the extra efficient variation for fine-tuning.
1. The runtime comparison is not convincing to me (please see questions for more details). 2. The experiments are missing some relevant baselines (again, see questions for more details). 3. More ablation on the zero'th order gradients are needed.
- Novel use of forward-only finite-difference for QAT, removing the need for backpropagation. - Significantly reduces memory and computation cost, enabling on-device training. - Integrates adaptive outlier smoothing for improved low-bit act. quant. stability. - Simple, practical framework with clear implementation feasibility.
- The comparisons are mostly against outdated PTQ and QAT methods. For a fair and convincing evaluation, the paper should include more recent baselines such as ParetoQ, UPQ, and BitNet (for ternary/binary quantization). It would also be interesting to see whether the proposed method can achieve comparable or superior performance under the same settings as BitNet. - The paper omits recent PTQ methods like BoA (https://arxiv.org/abs/2406.13474) and FlatQuant (https://arxiv.org/abs/2410.09426), bo
The paper is well-written and easy to follow.
1. Limited contributions - As the authors pointed out in line 202, recent works have already combined the zeroth-order optimization with QAT under weight-only quantization. It seems that the authors extend such works to the weight-activation quantization scenario, which seems to be a marginal contribution. If the authors believe that their contribution is meaningful, then they need to clarify 1) the difficulty of extending existing methods to the weight-activation quantization and 2) how they o
1. Clear idea and solid motivation: combining ZO with QAT to avoid STE bias/backprop memory while retaining end-to-end optimization. 2. Practical design: learnable smoothing and adaptive asymmetric clipping; lightweight Q/V-only variant for memory-limited fine-tuning; on-device evidence. 3. Writing is clear.
1. Scale and feasibility: As an efficiency-oriented QAT approach, the paper’s efficiency results suggest that running ZeroQAT on larger models (e.g., ≥30B, potentially even 70B) on a single A100 may be feasible. However, such larger-scale experiments are missing. Please add results or a careful quantitative feasibility analysis. 2. Model recency and difficulty: The benchmarks rely on older model families (Llama-1/2, OPT). It would strengthen the case to include modern, harder-to-quantize models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Data Storage Technologies
End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost
Qitao Tan1 Xiaoying Song2 Jin Lu1 Guoming Li1 Jun Liu3 Lingzi Hong2
Caiwen Ding4 Jundong Li5 Xiaoming Zhai1 Shaoyi Huang6 Wei Niu1 **Geng Yuan1
1**University of Georgia 2University of North Texas 3Northeastern University
4University of Minnesota 5University of Virginia 6Stevens Institute of Technology
Abstract
Quantization is an effective technique to reduce the deployment cost of large language models (LLMs), and post-training quantization (PTQ) has been widely studied due to its efficiency. However, existing PTQ methods are limited by their inability to fine-tune model parameters and often suffer significant accuracy loss in low-bit scenarios. Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs, limiting its practicality for LLM deployment. To address these challenges, we propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization. ZeroQAT leverages forward-only gradient estimation to eliminate backpropagation, substantially reducing computational and memory overhead while retaining the benefits of end-to-end optimization. We further introduce a lightweight variant of ZeroQAT for quantized fine-tuning, which freezes and pre-quantizes most parameters to further cut memory usage. Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory. For example, ZeroQAT enables fine-tuning of a 13B model at extremely low bit-widths (e.g., 2-4 bits) on a single 8GB GPU, and even allows fine-tuning a 6.7B model on a OnePlus 12 smartphone, demonstrating its practicality for end-to-end QAT on resource-limited edge devices.
1 Introduction
Large language models (LLMs) have emerged as essential tools for advancing natural language understanding and generation, driving progress in both research and industrial applications (Yang et al., 2019; Liu et al., 2019; Talmor et al., 2018; Chowdhery et al., 2023; Zheng et al., 2020). Despite their transformative potential, training and deploying these models incur extremely high computational and memory costs. Such requirements not only constrain accessibility and scalability but also limit practicality in resource-constrained environments, including mobile and edge devices, embedded systems, and even enterprise servers with strict hardware or budget limitations (Zeng et al., 2024; Chen et al., 2024a; Tan et al., 2025).
To address these challenges, model compression has been widely studied, with quantization being one of the most effective and indispensable techniques for deployment. Quantization methods are generally divided into post-training quantization (PTQ) and quantization-aware training (QAT). PTQ is simple and widely adopted as it avoids retraining, while QAT usually achieves higher accuracy when resources permit. However, for LLMs the memory demand of QAT is prohibitive (Team et al., 2025). For example, fine-tuning LLama-7B may require hundreds of gigabytes of GPU memory, and larger models often need multi-node clusters, which severely limits practicality. As a result, PTQ dominates in practice, not for its superiority but feasibility.
In low-bit scenarios, the adaptation capability for distribution shifts and mitigate performance degradation becomes the key factor that determines whether a quantization method can preserve model quality. This adaptation capability reflects how well the method can handle the distortions introduced by quantization, with stronger adaptation generally leading to more reliable performance. Range-based PTQ (Jacob et al., 2018; Nagel et al., 2019; Xiao et al., 2023), which derives parameters from activation or weight ranges, offers limited adaptation and often loses accuracy. More advanced PTQ methods, such as approximation-based approaches (Nagel et al., 2020; Li et al., 2021; Frantar et al., 2022; Shao et al., 2023), better align with full-precision outputs but are still not end-to-end optimization schemes. As a result, they often introduce two characteristic issues: cumulative errors and objective inconsistency, hinder accuracy especially in low-bit settings. These issues are amplified in fine-tuned models, which are highly task-specific and sensitive to quantization perturbations (Dong et al., 2021). Consequently, PTQ often delivers unsatisfactory accuracy in deployment.
QAT provides a principled solution by modeling quantization effects during training, allowing the model to mitigate quantization errors. While QAT shows strong robustness in low-bit regimes (below 8 bits), its prohibitive memory footprint from backpropagation limits applicability to large-scale models. Recent advances in zeroth-order (ZO) optimization, which estimate gradients using only forward passes (e.g., finite differences), significantly reduce memory usage by avoiding storage of activations and optimizer states, offering a promising path for memory-efficient fine-tuning. This naturally raises the question: Can ZO be combined with QAT to achieve high-quality low-bit quantization of LLMs, with memory efficiency comparable to inference?
In this work, we propose ZeroQAT, the first end-to-end QAT framework supporting both low-bit weight and activation on-device quantization. As shown in Table 1, ZeroQAT reduces the resource burden of conventional QAT while mitigating the accuracy loss commonly seen in PTQ. Unlike prior methods that require massive computing resources (Liu et al., 2023), ZeroQAT updates model parameters using gradients estimated purely from forward passes, reducing memory usage to inference-level and making QAT feasible even on edge devices. It further integrates learnable weight clipping and activation transformations, optimized jointly with model parameters via ZO. Moreover, a lightweight variant is devised for further memory reduction. Experiments on both quantized pre-training and fine-tuning show that ZeroQAT consistently outperforms representative PTQ and QAT baselines. For instance, it improves accuracy by 5.1% on average over five zero-shot tasks and 9.1% on four downstream tasks under 2-bit weight-only quantization. More importantly, ZeroQAT overcomes the memory barrier of QAT, enabling training of 13B LLM on a single 8GB low-end GPU and even fine-tuning 6.7B model on OnePlus 12 smartphone. This capability makes end-to-end on-device QAT practical on resource-constrained edge devices.
In summary, our major contributions are as follows:
- We conduct a preliminary study of PTQ and QAT in low-bit pre-training and fine-tuning, revealing their weaknesses and causes of performance degradation. 2) We propose ZeroQAT, a novel end-to-end zeroth-order QAT framework that achieves high-quality low-bit quantization with inference-level cost. 3) We conduct extensive evaluation across LLM architectures, datasets, and quantization settings, showing consistent accuracy and memory improvements over prior PTQ and QAT baselines.
- We validate on mobile devices, where ZeroQAT can fine-tune OPT-6.7B on OnePlus 12 smartphone while full-precision ZO fine-tuning is infeasible, demonstrating its practicality for real-world deployment.
2 Background and Related Works
Quantization. In this work, we mainly study the widely used uniform quantization (Jacob et al., 2018) for its better efficiency. The quantization process can be formulated by:
[TABLE]
where is the floating-point tensor, is the quantized counterpart, is rounding operation, is the target bit number, and denote the step size and zero-point offset value respectively. For symmetric quantization, , , and . Whereas for asymmetric quantization, , , and . In this paper, we focus on the asymmetric quantization scheme for its better accuracy.
Layer-wise calibration. Layer-wise calibration strategy is the most widely adopted approach in approximation-based PTQ, because it is relatively efficient in terms of memory, computation, and data usage. The key idea is to minimize quantization error via reconstruction objectives. For example, the widely used layer-wise reconstruction loss minimizes the squared error, relative to the full precision layer output (Li et al., 2021; Shao et al., 2023). Formally, when both weights and activations are quantized, this can be stated as
[TABLE]
where are the quantized version of weight and activations, indicates the -th layer.
We present our related works section in Appendix B.
3 Challenge of Existing Quantization Methods
3.1 Challenge of Existing Post Training Quantization Methods
Range-based PTQ. These methods rescale or clip weight and activation ranges to reduce quantization error. They are computationally efficient and perform reasonably well at moderate bit-width. For example, SmoothQuant (Xiao et al., 2023) achieves a perplexity of 6.20 in W6A6 (i.e., quantization using 6 bits weight and 6 bits activation), close to the full-precision 5.47 (Table 2). However, their limited adaptation to distributional and semantic characteristics leads to severe degradation at low bit-widths. For example, under W4A4, SmoothQuant’s perplexity deteriorates to 83.12 versus 5.47 in full precision.
Approximation-based PTQ. These methods narrow the gap between quantized and full-precision outputs via techniques such as learned rounding or reconstruction, adapting to data distributions and model behavior. However, there are two issues still remain and are exacerbated in low-bit quantization settings.
Here, we take a representative approximation-based PTQ method, OmniQuant (Shao et al., 2023), as an example to illustrate the two issues. 1) Cumulative error propagation. To measure error propagation, we compute relative loss reduction across layers, , where and denote reconstruction loss before and after optimization. As shown in Figure 1, OmniQuant improves shallow layers but benefits diminish in deeper ones, since each layer is optimized on activations already perturbed by prior quantization noise, making it increasingly difficult to suppress the reconstruction error. This cumulative error propagation constrains overall quantization quality. 2) Objective inconsistency. OmniQuant uses layer-wise reconstruction loss (see Eq.1) as training objective, assuming lower reconstruction loss is aligned with lower perplexity and better downstream accuracy. However, as shown in Figure 2, this alignment does not always hold, in several training stages (highlighted in red), reconstruction loss decreases while perplexity fluctuates. This indicates that local layer-level improvements do not reliably translate into global task-level gains, making reconstruction loss a suboptimal proxy for end-to-end performance, especially under low-bit quantization.
Failure on fine-tuned model. When PTQ is applied to fine-tuned LLMs, it often fails to preserve task accuracy under low-bit settings. As shown in Table 2, SmoothQuant maintains moderate accuracy at W6A6 (57.2% vs. 66.0% in FP16) but drops to 32.9% at W4A4. Similarly, OmniQuant achieves 63.9% at W6A6, close to FP16, yet falls to 38.8% at W4A4 despite optimization-based techniques. These results indicate that while PTQ remains viable at moderate bit-widths, its effectiveness collapses under aggressive compression, in some cases nearly destroying task performance.
3.2 Challenge of Existing Quantization-aware Training Methods
Compared with PTQ, QAT offers stronger adaptation by compensating for quantization errors during training. However, its computational and memory costs are prohibitive for LLMs (Liu et al., 2023). To reduce this overhead, later works combine QAT with parameter-efficient methods such as LoRA (Dettmers et al., 2023; Xu et al., 2023; Li et al., 2023) or update only quantizer parameters (Chen et al., 2024b), achieving competitive results in weight-only quantization. Yet their effectiveness drops in low-bit joint weight-activation settings, as shown in Table 2, EfficientQAT maintains reasonable perplexity at W6A6 (5.60) and W2A16 (33.40), but degrades sharply at W4A4 (76.32), highlighting the difficulty of modeling dynamic activations.
Overall, although QAT methods can surpass PTQ in some settings, they have not consistently delivered strong results for both weight and activation quantization at aggressive bit-widths under realistic resource constraints. Recent efforts that combine zeroth-order (ZO) optimization with quantization primarily target weight-only scenarios (Zhou et al., 2025; Shang et al., 2025), thus leaving the challenges of low-bit activation quantization unresolved. Motivated by this gap, we develop a ZO-based QAT framework that, to the best of our knowledge, is the first to maintain superior accuracy in both low-bit weight and activation settings.
4 ZeroQAT
In this section, we present ZeroQAT, which enables adaptive fine-tuning of both model and quantization parameters with low memory requirements. We employ zeroth-order stochastic gradient descent to estimate gradients solely from quantized model inference, and introduce adaptive smoothing and weight quantization strategies to improve low-bit performance. Unlike prior works that rely on hand-crafted or layer-wise local objectives, ZeroQAT jointly optimizes model and quantization parameters in an end-to-end manner, yielding superior accuracy. In addition, we propose a lightweight variant to further cut memory cost during quantized fine-tuning.
4.1 Quantization-aware Zeroth-order Optimization
Unlike conventional first-order optimization that computes gradients via backpropagation, zeroth-order (ZO) optimization estimates them using only function queries through finite differences (Chen et al., 2023; Liu et al., 2018; Ye et al., 2018). This avoids storing activations, backward gradients, and optimizer states, greatly reducing memory costs in LLM fine-tuning. For each random direction, ZO requires only two forward passes to approximate the gradient, given a mini-batch :
[TABLE]
where is the quantizer, is the quantized parameters, is a random perturbation, is the number of directions, and is a small scalar.
Following QAT practice, we maintain full-precision weights while using their quantized counterparts in forward passes. Unlike FO-QAT, ZeroQAT does not require the straight-through estimator (STE) (Bengio et al., 2013), since gradients are estimated directly via zeroth-order finite differences, bypassing the non-differentiability of the quantizer. Given a learning rate and a mini-batch at iteration , the update rule becomes:
[TABLE]
In ZeroQAT, the ZO estimator remains unbiased with respect to the gradient of a smoothed quantized objective, which ensures standard convergence guarantees. In contrast, QAT methods based on the STE rely on a hand-crafted surrogate gradient that introduces inherent bias. This bias becomes particularly severe in low-bit regimes, where the true smoothed gradients are already small but STE still produces large surrogate updates, leading to unstable or suboptimal convergence. A formal analysis and quantitative bounds on this bias are provided in Appendix G.
4.2 Adaptive Outlier Smoothing and Weight Quantizer
Adaptive outlier smoothing. Due to the quantization error caused by the extreme activation outliers in specific channels, which expand the dynamic range and degrade quantization precision for normal activation values, the previous methods (Xiao et al., 2023; Wei et al., 2022; Shao et al., 2023) migrate the difficulty of activation quantization to weight quantization with a mathematically equivalent smoothing, as the weights are generally more uniform and thus easier to be quantized. However, relying on either hand-crafted smoothing parameters or layer-wise calibrated smoothing often results in suboptimal performance, due to the lack of end-to-end joint optimization.
In contrast, our QAT framework enables end-to-end joint optimization of smoothing parameters along with model parameters, thereby improving consistency and reducing quantization error. Inspired by previous works such as SmoothQuant (Xiao et al., 2023) and Outlier Suppression+ (Wei et al., 2022), which statically manipulate activation distributions via channel-wise scaling and shifting, we adapt these techniques into a jointly optimized framework to dynamically mitigate activation outliers during training, providing an effective solution for the outlier issue. Specifically, we represent the computation of a linear layer as:
[TABLE]
where , the is the sequence length, is the weight matrix and is the bias. Here, and are learnable channel-wise scaling and shifting parameters, jointly optimized during training, and represent the smoothed activation, weight and bias, respectively, and are element-wise division and multiplication.
Adaptive weight quantizer. As demonstrated by previous work, some weights play a significant role in the performance of the model, naive uniform quantization can cause significant performance degradation. Similar to previous QAT methods that adopt learnable step size and zero-point parameters (Esser et al., 2019; Bhalgat et al., 2020), we also conduct weight quantization with the learnable step size and offset. However, due to the activation-weight smoothing introduced in our framework, the weight distributions in some channels become skewed, resembling the activation distributions and deviating from the typically assumed uniformity. Therefore, we jointly learn clipping thresholds to adaptively determine the optimal clipping range for weights.
Specifically, considering asymmetric quantization, the quantization of weights as formulated by
[TABLE]
where and are learnable step size and zero-point, respectively, initialized based on the default asymmetric quantization scheme. and are learnable clipping coefficients (with ), and denotes the maximum positive quantization level. Intuitively, for weights with near-uniform distributions after smoothing, and converge to similar values, resulting in a tight clipping range that preserves precision. In contrast, for biased weight distributions, and adapt to asymmetrically clip the dynamic range, thereby mitigating the impact of outliers.
4.3 Lightweight ZeroQAT for memory reduction in Quantized Fine-tuning
We further propose a lightweight variant of ZeroQAT designed specifically for quantized fine-tuning, to substantially reduce the fine-tuning memory footprint. It is worth noting that this strategy is effective only in fine-tuning, applying it to quantized pre-training leads to noticeable performance degradation (see Appendix D.3).
Unlike backpropagation-based methods, where memory is dominated by weights, activations, and optimizer states, ZeroQAT’s cost mainly comes from the parameters actively updated during fine-tuning. Pre-quantizing the entire model could further reduce memory, but this fails in practice. As small ZO perturbations are rounded away while large ones destabilize training, making naive full-model pre-quantization unsuitable.
To overcome this, we introduce a lightweight variant. Most parameters are frozen and pre-quantized, while only the query (Q) and value (V) matrices of attention layers are kept in full precision, as illustrated in Figure 3. Thus, memory use comes from the full-precision Q and V plus quantized frozen weights. This design substantially reduces the fine-tuning footprint while retaining sufficient trainable capacity for adaptation. This enables fine-tuning large models such as OPT-13B under low-bit settings with memory as low as 6.8 GB (in Table 7), far lighter than existing QAT baselines.
5 Experiment
We present a comprehensive evaluation of ZeroQAT, reporting results on both quantized pre-training and quantized fine-tuning (Sections 5.1 and 5.2), followed by ablation studies to assess the contributions of different design (Section 5.3). We then provide an efficiency analysis including memory and speed (Section 5.4). Hyperparameter settings are detailed in Appendix C.1. GPU-end experiments are conducted on an NVIDIA A100, and device-end experiments are conducted on a OnePlus 12 smartphone with a Snapdragon 8 Gen 3 SoC and 16GB RAM. All results are averaged over three runs.
5.1 ZeroQAT for Quantized Pre-training
Training and evaluation. For the parameters of smoothing and weight clipping, we leverage reconstruction loss for a lightweight initialization, and then jointly train with the model via ZO. For LLama-series weight-only quantization, we retain only weight clipping. Pre-training uses mixed segments from WikiText2 and C4, with perplexity measured on three pretraining context datasets. We further evaluate zero-shot accuracy on five datasets under GPTQ settings with lm-eval-harness. More details including baselines are provided in Appendix C.2.
Perplexity Results. We target to examine the intrinsic language modeling performance of the quantized model. The perplexity results of LLama-series and OPT-series models are presented in Table 3 and Table E.1 respectively. Under the rather easier quantization setting W6A6, the baselines and our method achieve similar, almost lossless performance compared with full precision, absolute perplexity gap is smaller than one. More importantly, under the hard quantization setting W2A16(g128) and W4A4, because our method has better adaptation capability by enabling fine-tuning of the whole model, one can see that ZeroQAT consistently outperforms the baseline methods, yielding lower perplexity across both model families and datasets. This highlights the effectiveness of ZeroQAT in preserving model quality under aggressive quantization.
Zero-shot Accuracy Results. Moreover, Table 4 reports the zero-shot results of LLama-7B on five downstream datasets evaluated by accuracy. As expected, the FP16 setting achieves the highest average accuracy, serving as the upper bound. Under both the W2A16 and W4A4 configurations, ZeroQAT consistently outperforms other quantization approaches, yielding higher average accuracy across both model scales, for instance, significantly increasing 5.1% accuracy in 2-bit weight-only quantization. This result demonstrates that ZeroQAT maintains strong task generalization even when quantization is pushed to low-bit precision.
5.2 ZeroQAT for Quantized Fine-tuning
Training and Evaluation. Following prior work, we fine-tune models on a small subset of Alpaca and evaluate across multiple benchmarks, including commonsense reasoning, classification, and question answering tasks. We adopt a few-shot fine-tuning protocol with fixed quantization parameters and report averaged results over three runs. Full experimental details and baselines are provided in Appendix C.3.
Results. We evaluate quantized fine-tuning on OPT models (2.7B, 6.7B, and 13B) across two classification tasks (SST-2, CB) and two QA generation tasks (SQuAD, DROP). For PTQ methods such as SmoothQuant and OmniQuant, we first fine-tune the models in full precision using ZO to ensure comparable starting points, and then apply the corresponding quantization method. In contrast, QAT methods, including ZeroQAT, directly produce quantized models during fine-tuning without the need for a separate PTQ stage.
The results are summarized in Table 5. Fine-tuning adapts model parameters to narrow task-specific optima (Dong et al., 2021), which increases their sensitivity to quantization noise. Consequently, less adaptive PTQ methods suffer from severe degradation in low-bit settings. By comparison, ZeroQAT consistently delivers higher accuracy across all tasks and model scales, in some cases approaching FP16 performance. For example, under the W4A4 setting, ZeroQAT achieves about 88% accuracy on SST-2 across the three OPT models, whereas baseline methods remain around 60%. We also fine-tuned LLama-1 models on Alpaca, with results shown in Table 6. ZeroQAT again outperforms prior methods across different bit-widths and model sizes. For instance, when quantizing LLama-7B and LLama-13B weights to 2 bits, ZeroQAT achieves absolute accuracy improvements of 4.8% and 3.6% over the best baseline EfficientQAT, illustrating the effectiveness of our approach.
5.3 Ablation Study
In this section, we conduct ablation study to examine the effectiveness of the strategies adopted in our method. More experiments are shown in Appendix D.
Effect of initialization for Smoothing Parameters. We initialize the smoothing parameters by minimizing reconstruction loss before applying ZO, to examine the impact of initialization quality, we conduct an ablation study by varying the number of initialization epochs, as reported in Table D.6. The results show that initialization has a clear effect on performance. With 0 epochs of initialization, performance drops noticeably across different models, while additional epochs (e.g., 20) can further improve accuracy. However, considering both performance gains and computational cost, we adopt two epochs as the default initialization setting.
5.4 Efficiency of ZeroQAT
To highlight the advantage that our method enables generating a quantized and fine-tuned model in a lightweight end-to-end pipeline, we evaluate the efficiency of ZeroQAT on both a GPU server and a mobile device to demonstrate its practicality across deployment scenarios.
Server-side Efficiency. Table 7 compares memory requirements and wallclock time per update across QAT and PTQ methods. For quantized pre-training, ZeroQAT reduces memory usage by 89-92% relative to the costly LLM-QAT, while also accelerating training. Compared to the PTQ method OmniQuant, ZeroQAT offers clear advantages, for instance, it halves memory use (OPT-1.3B: 6.1GB to 3.1GB) and achieves about 1.5× faster updates (OPT-2.7B: 1.49s to 0.98s). For quantized fine-tuning, ZeroQAT’s memory-efficient design requires storing only weights, making usage independent of batch size. Against EfficientQAT, it consistently saves memory and improves throughput, especially on smaller models such as OPT-1.3B, reducing memory by 86% (5.9GB to 0.8GB) and wallclock time by 55% (0.69s to 0.31s) with the same batch size.
On-device Efficiency. Table 8 compares FP16 baseline with ZeroQAT under W4A4 for OPT-1.3B, 2.7B, and 6.7B models. The results were collected on a OnePlus 12 smartphone with a Snapdragon 8 Gen 3 SoC and 16GB RAM. ZeroQAT reduces fine-tuning latency by 30% and 37% for OPT-1.3B and OPT-2.7B, respectively, while cutting running memory from 3.5GB to 1.2GB and from 8.1GB to 2.6GB. For OPT-6.7B, FP16 fine-tuning is infeasible (OOM), whereas ZeroQAT runs within 6.4GB memory with 29.1s latency. During inference, ZeroQAT further achieves 1.41-1.52 higher token throughput, demonstrating its practicality on resource-constrained devices.
6 Conclusion
In this paper, we proposed ZeroQAT, a zeroth-order-based quantization-aware training framework supporting both weight and activation quantization under extremely low bit-widths. We further introduced adaptive smoothing and an adaptive weight quantizer to reduce errors, and a lightweight variant that freezes and quantizes part of the model to lower fine-tuning memory cost. Experiments on quantized pre-training, fine-tuning, and on-device deployment show that ZeroQAT consistently outperforms PTQ and QAT baselines in both accuracy and efficiency, and even enables fine-tuning large LLMs on OnePlus 12 smartphone under strict memory constraints.
Appendix A Claim of LLM Usage
In this work, large language models (LLMs) were used solely as a general-purpose writing assistant. Their role was limited to correcting grammar, fixing typographical errors, and polishing the language for clarity and readability.
Appendix B Related work
B.1 Model Quantization
Quantization techniques aim to properly map the original continuous real values to a discrete low-bit format (e.g., INT8 or INT4), leading to significant memory saving and inference acceleration while maintaining the performance (Zhou et al., 2016). Quantization techniques can be generally divided into two categories: Post-training quantization (PTQ) and quantization-aware training (QAT). The QAT method generally yields better results due to better adaptation capability, but the high retraining cost (in both memory and computation) has discouraged many researchers. Therefore, most of the LLM quantization works focus on PTQ methods, which can be mainly divided into range-based PTQ (Jacob et al., 2018; Nagel et al., 2019; Xiao et al., 2023) and approximation-based PTQ methods (Nagel et al., 2020; Li et al., 2021; Frantar et al., 2022; Shao et al., 2023). The range-based PTQ typically relies on static analysis, where the range (e.g., minimum and maximum values) of weights or activations is collected to determine quantization parameters. The approximation-based PTQ methods, with more adaptation, explicitly frame quantization as an error minimization problem, optimizing quantized parameters to closely approximate the full-precision model outputs.
B.2 Zeroth-order Optimization
Zeroth-order optimization (ZO), which estimates gradients using only function evaluations, has emerged as an attractive alternative to classical first-order (FO) methods. Compared to FO approaches, ZO eliminates the need for backpropagation, thereby simplifying implementation and significantly reducing memory consumption. This makes it appealing in scenarios such as adversarial attack and defense (Chen et al., 2017; Ye et al., 2018; Verma et al., 2023), machine learning explainability (Dhurandhar et al., 2018; 2019), reinforcement learning (Vemula et al., 2019), and on-chip training (Gu et al., 2021). Despite these successes, ZO optimization has been primarily applied to relatively small-scale problems, since its convergence is generally slower and suffers from high variance due to random search. These challenges are exacerbated in large-scale settings such as LLM fine-tuning, where dimensionality and resource constraints amplify the difficulty. To access further acceleration and compression, there are some works that focus on combining ZO with quantization (Zhou et al., 2025; Shang et al., 2025), while our method is the first to overcome the accuracy degradation in both low-bit weight and activation quantization scenarios.
Appendix C Experimental Settings
Quantization settings. To comprehensively evaluate our method, we consider both weight-only and weight-activation quantization, as they represent distinct deployment scenarios. For weight-activation quantization, we adopt per-channel weight quantization and per-token activation quantization, following prior work (Dettmers et al., 2022; Shao et al., 2023). For weight-only quantization, we apply a group-wise strategy, where the weight matrix is partitioned into groups of a fixed size, and each group is assigned its own scale and zero point. Formally, for example, W2A16g128 refers to 2-bit weight-only quantization with 128 as the group size. When is omitted (e.g., W2A16), the default group size is set to the number of channels, corresponding to per-channel quantization.
C.1 Hyperparameter Setting
We use the hyperparameters in Table C.1 for experiments on quantized pre-training and quantized fine-tuning. Specifically, pre-training prefers smaller learning rate and smaller perturbation for stable convergence, while for fine-tuning, we can use more aggressive optimization. Moreover, larger models prefers smaller learning rate and smaller perturbation, while smaller models tend to have the opposite.
C.2 Settings of Quantized Pre-training
Training and evaluation Zeroth-order optimization has been shown to benefit from strong initialization (Malladi et al., 2023). To provide a stable starting point, we adopt a lightweight initialization strategy based on channel-wise scaling and shifting. Specifically, we pre-train quantized models with OmniQuant (Shao et al., 2023) for a few epochs (2 epochs in the W4A4 setting and 4 epochs in the W2A16 setting), which corresponds to roughly 10% of the full OmniQuant training cost. This initialization enables ZO to more effectively refine the quantization scales and shift factors. But for LLama-series weight-only quantization, we remove the smoothing scalar and only maintain weight clipping as smoothing only provides limited improvement. For quantized pre-training, we randomly select token segments with length 2048 and than calculate perplexity over WikiText2 (Merity et al., 2016), PTB (Marcus et al., 1994), and C4 (Raffel et al., 2020). To avoid overfitting on one specific dataset, half segments samples from WikiText2 and half from C4, while the total data size is keep same with previous work (Shao et al., 2023; Dettmers et al., 2022) and set as 128. We further assess zero-shot accuracy on a range of tasks including PIQA (Bisk et al., 2020), ARC (Clark et al., 2018), HellaSwag (Zellers et al., 2019), Winogrande (Sakaguchi et al., 2021). We adhere to the GPTQ (Frantar et al., 2022) settings for language generation experiments, and leverage the lm-eval-harness (Gao et al., 2024) tool for the evaluation of all zero-shot tasks.
Baselines We mainly compared with post-training quantization methods. For weight-only quantization, we compare with the vanilla round-to-nearest (RTN), GPTQ (Frantar et al., 2022). For weight-activation quantization, we compare our method with SmoothQuant (Xiao et al., 2023), RPTQ (Yuan et al., 2023), OutlierSupression+ (OS+) (Wei et al., 2022), OmniQuant (Shao et al., 2023), and one QAT method LLM-QAT (Liu et al., 2023). We keep the quantization setting of SmoothQuant and Outlier Suppression+ with per-channel weight quantization and per-token activation quantization for fair comparisons.
C.3 Settings of Quantized fine-tuning
Training and evaluation. Following existing works (Chen et al., 2024b), we fine-tune models on a small subset of Alpaca dataset (Taori et al., 2023), and report the average accuracy on datasets including PIQA, ARC, HellaSwag and Winogrande. Moreover, we fine-tune and evaluate on two classification datasets, SST-2 (Socher et al., 2013) and CB (De Marneffe et al., 2019), and two question answering datasets, SQuAD (Rajpurkar et al., 2016) and DROP (Dua et al., 2019). For these tasks, we randomly sample 1,000 examples for training, 500 for validation, and 1,000 for testing, following the common few-shot fine-tuning protocol (Malladi et al., 2023). Performance is measured using accuracy for classification tasks and F1 scores for question answering tasks. The initialization of quantization parameters is identical to that used in quantized pre-training, and these parameters are frozen during fine-tuning. This design allows us to directly perform quantized fine-tuning without an additional quantized pre-training stage. For all fine-tuning experiments, we run our experiments three times with different seeds and report the averaged results.
Baselines. Beside the baseline methods used in quantized pre-training (in Section 5.1), we additionally compare our method with several leading QAT methods, including QLoRA (Dettmers et al., 2023), QA-LoRA (Xu et al., 2023), PEQA (Kim et al., 2023), IR-QLoRA (Qin et al., 2024), and EfficientQAT (Chen et al., 2024b).
Appendix D More ablation study on ZeroQAT
In this section, we conduct comprehensive ablation study on ZeroQAT to illustrate the effectiveness of the components or strategies we used. Specifically, the results include:
- •
Effect of learnable outlier smoothing and weight clipping (Table D.1).
- •
Effect of using fine-tuned checkpoint by first-order as PTQ’s starting point (Table D.2).
- •
Effect of using lightweight ZeroQAT for quantized pre-training (Table D.3).
- •
Effect of the layer selection in lightweight ZeroQAT (Table D.4).
- •
Effect of quantize parameter initialization and number of training samples (Table D.6 and Table D.6).
D.1 Effect of learnable outlier smoothing and weight clipping
In ZeroQAT, we introduce learnable smoothing scalar and weight clipping threshold to effectively relieve the outlier issue in low-bit quantization. We conduct experiments to ablate the effectiveness of these two learnable components. As shown in Table D.1, both components positively influence performance, but learnable smoothing proves essential for weight activation quantization. Disabling it for W4A4 results in a marked increase in perplexity, mainly due to challenges with activation quantization outliers. For weight-only quantization, smoothing only offer slight improvement for less outlier occurs (Shao et al., 2023), therefore the smoothing is not used for weight-only quantization.
D.2 Effect of using first-order fine-tuned model for PTQ
When comparing our method with PTQ methods, the starting points is the full-precision fine-tuned model using ZO, therefore we investigate if the PTQ method can perform better when using fine-tuned model by first-order (FO) optimization. As shown in Table D.2, when using first-order fine-tuned model as starting point, the memory cost of fine-tuning will dramatically increase to around 100 GB, while also not enhance the performance of PTQ, yielding much lower accuracy compared with FP ZO and ZeroQAT.
D.3 Effect of using lightweight ZeroQAT for quantized pre-training
In ZeroQAT fine-tuning, we devise a lightweight variant that keeps the query and value matrices in full precision while freezing and quantizing the remaining parameters. This design substantially reduces memory cost without sacrificing downstream task accuracy. However, when we apply the same strategy in quantized pre-training, we observe a clear performance drop, as shown in Table D.3. For example, on WikiText2, lightweight ZeroQAT yields perplexity of 41.05 and 21.97 for LLama2-7B and Llama2-13B under W2A16, compared to 29.61 and 15.95 without lightweight strategy.
This degradation can be attributed to the different optimization dynamics in pre-training versus fine-tuning. Pre-training requires updating a much larger parameter space to capture broad linguistic patterns. Freezing most of the model limits the ability to adapt quantization parameters and compensate for quantization noise, leading to accumulated errors and higher perplexity. In contrast, fine-tuning operates on narrower task-specific distributions, where updating Q and V alone is sufficient to preserve performance. These results highlight that while selective fine-tuning is effective for downstream adaptation, full-parameter optimization remains crucial in the pre-training stage under quantization.
D.4 Effect of fine-tuning layer selection.
We propose a lightweight variant ZeroQAT that fine-tunes only the query (Q) and value (V) matrices in the attention layers, while freezing and quantizing the remaining parts of the model to reduce memory overhead. To evaluate the effectiveness of this strategy, we compare it with different layer selection strategy, and the results are reported in Table D.4. The results show that this selective fine-tuning approach achieves a favorable trade-off between performance and memory efficiency: it maintains accuracy comparable to full-parameter fine-tuning, while reducing memory usage to 27%-38% of the full-parameter baseline, depending on the model size. This demonstrates that restricting updates to Q and V matrices provides substantial efficiency gains without significant loss of performance.
D.5 Effect of training sample size
Conventional first-order QAT methods are generally data-inefficient, as they rely on large training datasets to provide stable and accurate gradients. To examine whether ZeroQAT exhibits similar behavior, we vary the number of training samples and report the results in Table D.6. Compared to the default setting of 128 samples, changing the sample size has only a minor effect on performance, with most perplexity variations remaining within 0.5. This indicates that, unlike conventional methods, ZeroQAT does not heavily rely on large-scale data for convergence. Instead, since its gradients are estimated through noisy zeroth-order approximations, ZeroQAT benefits more from additional optimization iterations rather than larger datasets.
Appendix E More Quantized Pre-training Results
To illustrate the generalizability of our method, we conduct quantized pre-training on OPT family models, and the results are shown in Table E.1. For W6A6 quantization, similar to other baselines, ZeroQAT also achieves almost loss-less results on three datasets. For more challenge W4A4 setting, ZeroQAT consistently outperforms other baselines for better adaptation.
We conduct experiment on LLama with 13B parameters, results on 5 zero-shot datasets is show in Table E.2.
Appendix F Evaluation on MMLU
To demonstrate the generalizability of ZeroQAT in more realistic and challenging scenarios, we evaluate our method on MMLU, fine-tuning on the Alpaca dataset (Taori et al., 2023) and then evaluate. We conduct experiments based on Llama1-7B, the results are shown in Table F.1.
Appendix G Theoretical Analysis
Proposition 1** (Unbiasedness and explicit second-moment bound for the two-point ZO estimator).**
Let be the per-coordinate uniform quantizer of step size (rounding with optional clipping/zero-point), and let be -Lipschitz in its argument with respect to : for all and all mini-batches . For define the Gaussian-smoothed (forward-only) objective
[TABLE]
and the two-point ZO estimator with i.i.d. directions :
[TABLE]
Assume for all and . Then:
- (i)
Unbiasedness.* The estimator targets the gradient of the smoothed objective:*
[TABLE] 2. (ii)
Mean-squared error bound.* Writing the expectation over all randomness ,*
[TABLE]
In particular, ignoring the quantizer offset term (formally ), the estimator’s MSE scales as under standard Gaussian directions.
Proof.
(i) Unbiasedness. Let and write . Then with . Differentiating under the integral with respect to the mean of the Gaussian and using ,
[TABLE]
By antithetic symmetry of ,
[TABLE]
hence .
(ii) Second-moment/MSE bound. Let
[TABLE]
Using independence of the i.i.d. samples, By -Lipschitzness of and triangle inequality for ,
[TABLE]
where we used the standard quantization geometry and . Applying and Gaussian moment identities , yields
[TABLE]
Dividing by completes the proof. ∎
What STE assumes and why it is biased.
The straight-through estimator (STE) replaces the ill-defined Jacobian of the piecewise-constant quantizer by a hand-crafted surrogate (e.g., or a clipped indicator). The chain rule then yields the surrogate update
[TABLE]
Because is flat almost everywhere, the true chain rule gives a.e., so asserting implicitly enforces gradient invariance to the discrete parameterization: regardless of whether small perturbations of actually change . This mismatch makes a biased estimator of any well-defined target (e.g., from Gaussian smoothing, or Clarke’s generalized gradient of ), and the bias can remain large away from quantization thresholds where the true smoothed gradient vanishes in magnitude.
Proposition 2** (Worst-case STE bias in expectation, 1-D).**
Assume and a uniform -bit quantizer of step . Let be a -Lipschitz linear loss in its (quantized) argument. For , let be the distance to the nearest quantization threshold and set . Consider the common STE choice . Then, for every and ,
[TABLE]
In particular, for any , if
[TABLE]
then ; i.e., the STE exhibits an bias in expectation away from thresholds.
Proof via two lemmas.
We first record two standard ingredients.
Lemma 1** (1-D Gaussian tail identities).**
If and , then
[TABLE]
and Mills’ bound holds for , where and is the standard normal cdf. ∎
Lemma 2**.**
Let and . If is -Lipschitz in its argument, then
[TABLE]
Proof.* From the Gaussian-smoothing representation (two-point form),*
[TABLE]
By -Lipschitzness and symmetry,
[TABLE]
If , both perturbations stay in the same quantization cell and the difference vanishes; otherwise, the quantization geometry yields . Hence
[TABLE]
Expanding and applying Lemma 1 (with Mills’ bound) gives the claim. ∎
We now prove the proposition. For the stated STE with and the linear loss , one has , hence
[TABLE]
Therefore,
[TABLE]
Rearranging yields the thresholded lower bound.
Remark 1**.**
This formalizes the violation of gradient invariance and explains the expected bias away from thresholds. Multidimensional extensions follow by coordinate-wise threshold distances and union/tail bounds.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. ar Xiv preprint ar Xiv:1308.3432 , 2013.
- 2Bhalgat et al. (2020) Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen Blankevoort, and Nojun Kwak. Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pp. 696–697, 2020.
- 3Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence , volume 34, pp. 7432–7439, 2020.
- 4Chen et al. (2023) Aochuan Chen, Yimeng Zhang, Jinghan Jia, James Diffenderfer, Jiancheng Liu, Konstantinos Parasyris, Yihua Zhang, Zheng Zhang, Bhavya Kailkhura, and Sijia Liu. Deepzero: Scaling up zeroth-order optimization for deep model training. ar Xiv preprint ar Xiv:2310.02025 , 2023.
- 5Chen et al. (2024 a) Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, and Zhiru Zhang. Understanding the potential of fpga-based spatial acceleration for large language model inference. ACM Transactions on Reconfigurable Technology and Systems , 18(1):1–29, 2024 a.
- 6Chen et al. (2024 b) Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficientqat: Efficient quantization-aware training for large language models. ar Xiv preprint ar Xiv:2407.11062 , 2024 b.
- 7Chen et al. (2017) Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM workshop on artificial intelligence and security , pp. 15–26, 2017.
- 8Chowdhery et al. (2023) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research , 24(240):1–113, 2023.
