Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better,   Even Mid-Generation

Rohin Manvi; Anikait Singh; Stefano Ermon

arXiv:2410.02725·cs.CL·October 4, 2024

Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation

Rohin Manvi, Anikait Singh, Stefano Ermon

PDF

Open Access

TL;DR

This paper introduces a self-evaluation method for large language models that predicts mid-generation whether additional sampling will improve responses, reducing computation while maintaining or enhancing performance.

Contribution

It presents a novel generative reward model allowing LLMs to decide mid-generation if further sampling is beneficial, eliminating the need for external reward models.

Findings

01

Increases Llama 3.1 8B's win rate against GPT-4 from 21% to 34%.

02

Improves GSM8K math accuracy from 84% to 91%.

03

Achieves 74% of full sampling benefits with only 1.2 samples on average.

Abstract

Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReservoir Engineering and Simulation Methods · Distributed and Parallel Computing Systems

MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings