Efficient Beam Search for Large Language Models Using Trie-Based Decoding

Brian J Chan; MaoXun Huang; Jui-Hung Cheng; Chao-Ting Chen; Hen-Hsen Huang

arXiv:2502.00085·cs.CL·September 23, 2025

Efficient Beam Search for Large Language Models Using Trie-Based Decoding

Brian J Chan, MaoXun Huang, Jui-Hung Cheng, Chao-Ting Chen, Hen-Hsen Huang

PDF

Open Access 1 Video

TL;DR

This paper introduces a trie-based parallel decoding method for large language models that significantly reduces memory usage and speeds up decoding without sacrificing quality, making it suitable for resource-limited settings.

Contribution

The authors propose a novel trie-based decoding approach that shares KV caches across beams, improving memory efficiency and decoding speed for various attention architectures.

Findings

01

Achieves 4-8x memory savings during decoding

02

Provides up to 2.4x faster decoding speeds

03

Maintains comparable quality in summarization and code generation

Abstract

This work presents a novel trie (prefix-tree)-based parallel decoding method that addresses the memory inefficiency of batch-based beam search. By sharing a single KV cache across beams with common prefixes, our approach dramatically reduces memory usage and enables efficient decoding. We evaluated our method across three attention architectures, Multi-Head Attention (Phi-3.5-mini-instruct), Grouped Query Attention (Llama-3.1-8B-Instruct), and Sliding Window Attention (Mistral-Small-24B-Instruct-2501), using CNN/DailyMail for abstractive summarization and HumanEval for code generation. Our experiments demonstrate substantial memory savings (4--8 $\times$ ) and up to 2.4 $\times$ faster decoding, without compromising generation quality. These results highlight our method's suitability for memory-constrained environments and large-scale deployments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Efficient Beam Search for Large Language Models Using Trie-Based Decoding· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsADaptive gradient method with the OPTimal convergence rate