Efficient Beam Search for Large Language Models Using Trie-Based Decoding
Brian J Chan, MaoXun Huang, Jui-Hung Cheng, Chao-Ting Chen, Hen-Hsen Huang

TL;DR
This paper introduces a trie-based parallel decoding method for large language models that significantly reduces memory usage and speeds up decoding without sacrificing quality, making it suitable for resource-limited settings.
Contribution
The authors propose a novel trie-based decoding approach that shares KV caches across beams, improving memory efficiency and decoding speed for various attention architectures.
Findings
Achieves 4-8x memory savings during decoding
Provides up to 2.4x faster decoding speeds
Maintains comparable quality in summarization and code generation
Abstract
This work presents a novel trie (prefix-tree)-based parallel decoding method that addresses the memory inefficiency of batch-based beam search. By sharing a single KV cache across beams with common prefixes, our approach dramatically reduces memory usage and enables efficient decoding. We evaluated our method across three attention architectures, Multi-Head Attention (Phi-3.5-mini-instruct), Grouped Query Attention (Llama-3.1-8B-Instruct), and Sliding Window Attention (Mistral-Small-24B-Instruct-2501), using CNN/DailyMail for abstractive summarization and HumanEval for code generation. Our experiments demonstrate substantial memory savings (4--8) and up to 2.4 faster decoding, without compromising generation quality. These results highlight our method's suitability for memory-constrained environments and large-scale deployments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsADaptive gradient method with the OPTimal convergence rate
