Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention

Zhendong Zhang

arXiv:2502.05947·cs.CV·February 11, 2025

Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention

Zhendong Zhang

PDF

Open Access

TL;DR

This paper introduces a dynamic tree attention mechanism for multiple heads decoding in LLMs, significantly improving inference speed while preserving generation quality, by replacing fixed structures with adaptable, low-complexity candidate generation.

Contribution

It proposes a novel dynamic tree attention approach for multiple head decoding, enhancing efficiency in LLM inference with minimal complexity increase.

Findings

01

Improved decoding efficiency in LLMs

02

Maintained generation quality

03

Potential for faster inference in large models

Abstract

Multiple heads decoding accelerates the inference of Large Language Models (LLMs) by predicting next several tokens simultaneously. It generates and verifies multiple candidate sequences in parallel via tree attention with a fixed structure. In this paper, we replace the fixed tree attention with dynamic tree attention on multiple head decoding, specifically in the context of MEDUSA. We propose a simple and low complexity strategy to generate candidates and construct the dynamic tree structure. Preliminary experiments show that the proposed method improves the decoding efficiency of multiple head decoding for LLMs while maintaining the generation quality. This result demonstrates the potential for improvement of multiple head decoding in candidate generation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Vehicle License Plate Recognition · Neural Networks and Applications