Inference Optimizations for Large Language Models: Effects, Challenges,   and Practical Considerations

Leo Donisch; Sigurd Schacht; Carsten Lanquillon

arXiv:2408.03130·cs.CL·August 7, 2024

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

Leo Donisch, Sigurd Schacht, Carsten Lanquillon

PDF

Open Access

TL;DR

This paper reviews various optimization techniques like quantization, pruning, and distillation to enhance the efficiency of large language models, discussing their challenges and practical applications.

Contribution

It provides a comprehensive taxonomy and in-depth analysis of model compression and optimization methods for large language models.

Findings

01

Quantization reduces model size with minimal accuracy loss.

02

Pruning effectively decreases model complexity.

03

Knowledge distillation improves model efficiency without significant performance degradation.

Abstract

Large language models are ubiquitous in natural language processing because they can adapt to new tasks without retraining. However, their sheer scale and complexity present unique challenges and opportunities, prompting researchers and practitioners to explore novel model training, optimization, and deployment methods. This literature review focuses on various techniques for reducing resource requirements and compressing large language models, including quantization, pruning, knowledge distillation, and architectural optimizations. The primary objective is to explore each method in-depth and highlight its unique challenges and practical applications. The discussed methods are categorized into a taxonomy that presents an overview of the optimization landscape and helps navigate it to understand the research trajectory better.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques