Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

Uygar Kurt

arXiv:2601.14277·cs.LG·January 22, 2026

Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct

Uygar Kurt

PDF

Open Access 1 Models

TL;DR

This paper provides a comprehensive empirical evaluation of various quantization schemes for llama.cpp on the Llama-3.1-8B-Instruct model, guiding users in selecting optimal quantization methods based on performance and resource considerations.

Contribution

It offers a unified, systematic comparison of quantization formats for llama.cpp, covering performance, efficiency, and practical deployment insights.

Findings

01

Different quantization schemes vary significantly in downstream task performance.

02

Some formats offer better trade-offs between model size, speed, and accuracy.

03

Guidelines are provided for choosing quantization based on specific use cases.

Abstract

Quantization is a practical technique for making large language models easier to deploy by reducing the precision used to store and operate on model weights. This can lower memory use and improve runtime feasibility on constrained hardware, which is especially relevant for users running models locally. Quantization in llama.cpp enables large language models to run on commodity hardware, but available formats are often evaluated inconsistently, making it hard to choose among schemes. We present a unified empirical study of the llama.cpp quantization on a single modern model, Llama-3.1-8B-Instruct (FP16, GGUF), covering 3-8 bit K-quant and legacy formats. We evaluate downstream task performance across standard reasoning, knowledge, instruction-following, and truthfulness benchmarks, and also measure perplexity and CPU throughput (prefill/decoding) alongside model size, compression, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
uygarkurt/Llama-3.1-8B-Instruct-GGUF
model· 30 dl
30 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Natural Language Processing Techniques