Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE

Haiduo Huang; Fuwei Yang; Zhenhua Liu; Yixing Xu; Jinze Li; Yang Liu,; Xuanwu Yin; Dong Li; Pengju Ren; Emad Barsoum

arXiv:2502.06282·cs.CL·March 12, 2025

Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE

Haiduo Huang, Fuwei Yang, Zhenhua Liu, Yixing Xu, Jinze Li, Yang Liu,, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum

PDF

Open Access 1 Repo

TL;DR

Jakiro enhances speculative decoding for large language models by using Mixture of Experts to generate diverse predictions, combining hybrid inference strategies, and achieving state-of-the-art speed and accuracy improvements.

Contribution

Introducing Jakiro, a novel MoE-based approach that decouples candidate diversity in speculative decoding, along with a hybrid inference strategy and contrastive feature mechanism.

Findings

01

Significant accuracy improvements over baseline methods

02

Higher inference speedups achieved in experiments

03

Establishment of new state-of-the-art in speculative decoding

Abstract

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens, which are then verified in parallel by the larger target model. However, the limited capacity of the draft model often necessitates tree-based sampling to improve prediction accuracy, where multiple candidates are generated at each step. We identify a key limitation in this approach: the candidates at the same step are derived from the same representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions, effectively decoupling correlations among candidates. Furthermore, we introduce a hybrid inference strategy, combining autoregressive decoding for initial tokens with parallel decoding for subsequent stages, and enhance the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

haiduo/Jakiro
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Algorithms and Data Compression