A Comprehensive Survey of Accelerated Generation Techniques in Large   Language Models

Mahsa Khoshnoodi; Vinija Jain; Mingye Gao; Malavika Srikanth; Aman; Chadha

arXiv:2405.13019·cs.CL·May 27, 2024

A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models

Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman, Chadha

PDF

Open Access 1 Repo

TL;DR

This survey reviews recent methods to accelerate text generation in large language models, focusing on techniques like speculative decoding, early exiting, and non-autoregressive approaches to reduce inference latency.

Contribution

It categorizes and analyzes key acceleration techniques in autoregressive LLMs, providing insights and guidance for future research in efficient text generation.

Findings

01

Speculative decoding significantly reduces inference time.

02

Early exiting mechanisms improve efficiency with minimal accuracy loss.

03

Non-autoregressive methods offer promising speedups for large models.

Abstract

Despite the crucial importance of accelerating text generation in large language models (LLMs) for efficiently producing content, the sequential nature of this process often leads to high inference latency, posing challenges for real-time applications. Various techniques have been proposed and developed to address these challenges and improve efficiency. This paper presents a comprehensive survey of accelerated generation techniques in autoregressive language models, aiming to understand the state-of-the-art methods and their applications. We categorize these techniques into several key areas: speculative decoding, early exiting mechanisms, and non-autoregressive methods. We discuss each category's underlying principles, advantages, limitations, and recent advancements. Through this survey, we aim to offer insights into the current landscape of techniques in LLMs and provide guidance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arenaa/accelerated-generation-techniques
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis

MethodsEarly exiting using confidence measures