Automatic Task Detection and Heterogeneous LLM Speculative Decoding
Danying Ge, Jianhua Gao, Qizhi Jiang, Yifei Feng, Weixing Ji

TL;DR
This paper introduces an adaptive speculative decoding approach that automatically categorizes downstream tasks and assigns them to heterogeneous draft models, significantly improving inference speed and accuracy for large language models.
Contribution
It presents a novel task-aware speculative decoding algorithm with automatic task partitioning and dynamic prompt routing, enhancing efficiency and consistency across diverse tasks.
Findings
Improves draft accuracy by 6% to 50%.
Achieves 1.10x to 2.64x speedup in LLM inference.
Effectively handles heterogeneous downstream tasks.
Abstract
Speculative decoding, which combines a draft model with a target model, has emerged as an effective approach to accelerate large language model (LLM) inference. However, existing methods often face a trade-off between the acceptance rate and decoding speed in downstream tasks due to the limited capacity of the draft model, making it difficult to ensure efficiency across diverse tasks. To address this problem, we propose a speculative decoding algorithm tailored for downstream task optimization. It includes an automatic task partitioning and assigning method, which automatically categorizes downstream tasks into different sub-tasks and assigns them to a set of heterogeneous draft models. Each draft model is aligned with the target model using task-specific data, thereby enhancing the consistency of inference results. In addition, our proposed method incorporates an online lightweight…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
