Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE
Haiduo Huang, Fuwei Yang, Zhenhua Liu, Yixing Xu, Jinze Li, Yang Liu,, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum

TL;DR
Jakiro enhances speculative decoding for large language models by using Mixture of Experts to generate diverse predictions, combining hybrid inference strategies, and achieving state-of-the-art speed and accuracy improvements.
Contribution
Introducing Jakiro, a novel MoE-based approach that decouples candidate diversity in speculative decoding, along with a hybrid inference strategy and contrastive feature mechanism.
Findings
Significant accuracy improvements over baseline methods
Higher inference speedups achieved in experiments
Establishment of new state-of-the-art in speculative decoding
Abstract
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens, which are then verified in parallel by the larger target model. However, the limited capacity of the draft model often necessitates tree-based sampling to improve prediction accuracy, where multiple candidates are generated at each step. We identify a key limitation in this approach: the candidates at the same step are derived from the same representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions, effectively decoupling correlations among candidates. Furthermore, we introduce a hybrid inference strategy, combining autoregressive decoding for initial tokens with parallel decoding for subsequent stages, and enhance the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Algorithms and Data Compression
