Resource-Efficient Language Models: Quantization for Fast and Accessible Inference

Tollef Emil J{\o}rgensen

arXiv:2505.08620·cs.AI·May 14, 2025

Resource-Efficient Language Models: Quantization for Fast and Accessible Inference

Tollef Emil J{\o}rgensen

PDF

TL;DR

This paper reviews post-training quantization techniques that improve the inference efficiency of large language models, making them faster and more accessible while balancing accuracy and resource use.

Contribution

It provides a comprehensive overview of PTQ methods, including schemes, granularities, and trade-offs, bridging theory and practical applications.

Findings

01

Summarizes various PTQ schemes and their effectiveness.

02

Highlights trade-offs between model size, speed, and accuracy.

03

Offers insights into practical deployment of quantized LLMs.

Abstract

Large language models have significantly advanced natural language processing, yet their heavy resource demands pose severe challenges regarding hardware accessibility and energy consumption. This paper presents a focused and high-level review of post-training quantization (PTQ) techniques designed to optimize the inference efficiency of LLMs by the end-user, including details on various quantization schemes, granularities, and trade-offs. The aim is to provide a balanced overview between the theory and applications of post-training quantization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.