From Decoding to Meta-Generation: Inference-time Algorithms for Large   Language Models

Sean Welleck; Amanda Bertsch; Matthew Finlayson; Hailey Schoelkopf,; Alex Xie; Graham Neubig; Ilia Kulikov; Zaid Harchaoui

arXiv:2406.16838·cs.CL·November 21, 2024·2 cites

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf,, Alex Xie, Graham Neubig, Ilia Kulikov, Zaid Harchaoui

PDF

Open Access

TL;DR

This paper surveys inference-time algorithms for large language models, focusing on token-level, meta-generation, and efficiency methods, highlighting their roles in improving generation quality and speed during inference.

Contribution

It unifies diverse inference algorithms under a common formalism, bridging NLP, LLMs, and ML systems to advance understanding of inference-time improvements.

Findings

01

Token-level algorithms improve sampling quality.

02

Meta-generation incorporates external knowledge.

03

Efficiency methods reduce token costs.

Abstract

One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during inference. This survey focuses on these inference-time approaches. We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling a single token at a time or constructing a token-level search space and then selecting an output. These methods typically assume access to a language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings