KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft   Heads with Adversarial Learning

Kaiqi Zhang; Jing Zhao; Rui Chen

arXiv:2408.08146·cs.CL·August 16, 2024

KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning

Kaiqi Zhang, Jing Zhao, Rui Chen

PDF

Open Access

TL;DR

KOALA introduces a multi-layer adversarial training approach to improve speculative decoding in LLMs, significantly enhancing accuracy and reducing inference latency compared to traditional draft heads.

Contribution

It proposes a novel multi-layer draft head architecture with adversarial learning, substantially improving speculative decoding performance in large language models.

Findings

01

Latency speedup ratio improved by 0.24x-0.41x

02

Draft head accuracy significantly increased

03

Outperforms original draft heads by 10.57%-14.09% in speed

Abstract

Large Language Models (LLMs) exhibit high inference latency due to their autoregressive decoding nature. While the draft head in speculative decoding mitigates this issue, its full potential remains unexplored. In this paper, we introduce KOALA (K-layer Optimized Adversarial Learning Architecture), an orthogonal approach to the draft head. By transforming the conventional single-layer draft head into a multi-layer architecture and incorporating adversarial learning into the traditional supervised training, KOALA significantly improves the accuracy of the draft head in predicting subsequent tokens, thus more closely mirroring the functionality of LLMs. Although this improvement comes at the cost of slightly increased drafting overhead, KOALA substantially unlocks the draft head's potential, greatly enhancing speculative decoding. We conducted comprehensive evaluations of KOALA, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Digital Rights Management and Security