Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant

Jemin Lee; Sihyeong Park; Jinse Kwon; Jihun Oh; Yongin Kwon

arXiv:2409.11055·cs.CL·June 5, 2025·2 cites

Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant

Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, Yongin Kwon

PDF

Open Access 1 Repo

TL;DR

This study comprehensively evaluates quantization methods on instruction-tuned large language models from 1B to 405B parameters, revealing insights into their robustness, limitations, and the impact of quantization on various tasks.

Contribution

It provides a detailed analysis of four quantization methods across diverse datasets and model sizes, highlighting their effects on performance and task-specific robustness.

Findings

01

FP8 is the most robust quantization method across tasks.

02

Smaller models suffer severe accuracy drops at 4-bit quantization.

03

Quantization amplifies inherent model weaknesses rather than just task difficulty.

Abstract

Quantization has gained attention as a promising solution for the cost-effective deployment of large and small language models. However, most prior work has been limited to perplexity or basic knowledge tasks and lacks a comprehensive evaluation of recent models like Llama-3.3. In this paper, we conduct a comprehensive evaluation of instruction-tuned models spanning 1B to 405B parameters, applying four quantization methods across 13 datasets. Our findings reveal that (1) quantized models generally surpass smaller FP16 baselines, yet they often struggle with instruction-following and hallucination detection; (2) FP8 consistently emerges as the most robust option across tasks, and AWQ tends to outperform GPTQ in weight-only quantization; (3) smaller models can suffer severe accuracy drops at 4-bit quantization, while 70B-scale models maintain stable performance; (4) notably, \textit{hard}…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gitlab.com/ones-ai/eval-quant-llms
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need · LLaMA