Scaling LLM Inference with Optimized Sample Compute Allocation

Kexun Zhang; Shang Zhou; Danqing Wang; William Yang Wang; Lei Li

arXiv:2410.22480·cs.CL·October 31, 2024

Scaling LLM Inference with Optimized Sample Compute Allocation

Kexun Zhang, Shang Zhou, Danqing Wang, William Yang Wang, Lei Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces OSCA, a learning-based method to optimize sample compute allocation across different inference configurations in large language models, significantly reducing compute while improving accuracy.

Contribution

OSCA formulates sample compute allocation as a learning problem and demonstrates substantial efficiency gains and accuracy improvements over single-configuration approaches.

Findings

01

Achieves better accuracy than the best single configuration with 128x less compute on code generation.

02

Attains 25x less compute on 4 reasoning tasks with improved accuracy.

03

Effective in agentic workflows, reducing compute by 3x while enhancing accuracy.

Abstract

Sampling is a basic operation in many inference-time algorithms of large language models (LLMs). To scale up inference efficiently with a limited compute, it is crucial to find an optimal allocation for sample compute budgets: Which sampling configurations (model, temperature, language, etc.) do we use? How many samples do we generate in each configuration? We formulate these choices as a learning problem and propose OSCA, an algorithm that Optimizes Sample Compute Allocation by finding an optimal mix of different inference configurations. Our experiments show that with our learned mixed allocation, we can achieve accuracy better than the best single configuration with 128x less compute on code generation and 25x less compute on 4 reasoning tasks. OSCA is also shown to be effective in agentic workflows beyond single-turn tasks, achieving a better accuracy on SWE-Bench with 3x less…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

leililab/osca
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Neural Networks and Applications