Which Quantization Should I Use? A Unified Evaluation of llama.cpp Quantization on Llama-3.1-8B-Instruct
Uygar Kurt

TL;DR
This paper provides a comprehensive empirical evaluation of various quantization schemes for llama.cpp on the Llama-3.1-8B-Instruct model, guiding users in selecting optimal quantization methods based on performance and resource considerations.
Contribution
It offers a unified, systematic comparison of quantization formats for llama.cpp, covering performance, efficiency, and practical deployment insights.
Findings
Different quantization schemes vary significantly in downstream task performance.
Some formats offer better trade-offs between model size, speed, and accuracy.
Guidelines are provided for choosing quantization based on specific use cases.
Abstract
Quantization is a practical technique for making large language models easier to deploy by reducing the precision used to store and operate on model weights. This can lower memory use and improve runtime feasibility on constrained hardware, which is especially relevant for users running models locally. Quantization in llama.cpp enables large language models to run on commodity hardware, but available formats are often evaluated inconsistently, making it hard to choose among schemes. We present a unified empirical study of the llama.cpp quantization on a single modern model, Llama-3.1-8B-Instruct (FP16, GGUF), covering 3-8 bit K-quant and legacy formats. We evaluate downstream task performance across standard reasoning, knowledge, instruction-following, and truthfulness benchmarks, and also measure perplexity and CPU throughput (prefill/decoding) alongside model size, compression, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Natural Language Processing Techniques
