Yggdrasil: Bridging Dynamic Speculation and Static Runtime for Latency-Optimal Tree-Based LLM Decoding
Yue Guan, Changming Yu, Shihan Fang, Weiming Hu, Zaifeng Pan, Zheng Wang, Zihan Liu, Yangjie Zhou, Yufei Ding, Minyi Guo, Jingwen Leng

TL;DR
Yggdrasil is a system that enhances speculative decoding for large language models by aligning dynamic and static components, achieving near-optimal latency and significant speedups across hardware platforms.
Contribution
It introduces a co-designed approach with context-aware tree drafting, a latency-aware draft selection, and stage-based scheduling to optimize speculative decoding performance.
Findings
Achieves up to 3.98x speedup over baselines
Supports unmodified LLMs across hardware setups
Enables latency-optimal speculative decoding
Abstract
Speculative decoding improves LLM inference by generating and verifying multiple tokens in parallel, but existing systems suffer from suboptimal performance due to a mismatch between dynamic speculation and static runtime assumptions. We present Yggdrasil, a co-designed system that enables latency-optimal speculative decoding through context-aware tree drafting and compiler-friendly execution. Yggdrasil introduces an equal-growth tree structure for static graph compatibility, a latency-aware optimization objective for draft selection, and stage-based scheduling to reduce overhead. Yggdrasil supports unmodified LLMs and achieves up to speedup over state-of-the-art baselines across multiple hardware setups.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNetwork Packet Processing and Optimization · Graph Theory and Algorithms · Parallel Computing and Optimization Techniques
