Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective
Tianyao Shi, Yi Ding

TL;DR
This paper systematically characterizes the performance, energy, and quality tradeoffs of 11 LLM quantization methods across various models and hardware, revealing complex dependencies and deployment challenges.
Contribution
It introduces qMeter, an automated framework for comprehensive LLM quantization evaluation, and provides the first detailed analysis of quantization impacts at multiple system levels.
Findings
Quantization effects vary significantly by task and method.
Workload characteristics strongly influence quantization performance.
Deployment challenges include capacity planning and energy-efficient scheduling.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their heavy resource demands make quantization-reducing precision to lower-bit formats-critical for efficient serving. While many quantization methods exist, a systematic understanding of their performance, energy, and quality tradeoffs in realistic serving conditions remains a gap. In this work, we first develop a fully automated online characterization framework qMeter, and then conduct an in-depth characterization of 11 post-training LLM quantization methods across 4 model sizes (7B-70B) and two GPU architectures (A100, H100). We evaluate quantization at the application, workload, parallelism, and hardware levels under online serving conditions. Our study reveals highly task- and method-dependent tradeoffs, strong sensitivity to workload characteristics, and complex interactions with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
