Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference
Libo Zhang, Zhaoning Zhang, Baizhou Xu, Rui Li, Zhiliang Tian, Songzhu Mei, Dongsheng Li

TL;DR
Dovetail is a novel heterogeneous speculative decoding method that accelerates large language model inference on consumer-grade devices by leveraging CPU-GPU cooperation and optimizing data transfer, achieving up to 10.1x speedup.
Contribution
The paper introduces Dovetail, a lossless inference acceleration technique that utilizes heterogeneous hardware and speculative decoding with novel optimizations for improved efficiency.
Findings
Achieves 1.79x to 10.1x speedup on 13B models.
Reduces communication overhead through data transfer granularity optimization.
Maintains output quality and stability during acceleration.
Abstract
With the continuous advancement in the performance of large language models (LLMs), their demand for computational resources and memory has significantly increased, which poses major challenges for efficient inference on consumer-grade devices and legacy servers. These devices typically feature relatively weaker GPUs and stronger CPUs. Although techniques such as parameter offloading and partial offloading can alleviate GPU memory pressure to some extent, their effectiveness is limited due to communication latency and suboptimal hardware resource utilization. To address this issue, we propose Dovetail, a lossless inference acceleration method that leverages the complementary characteristics of heterogeneous devices and the advantages of speculative decoding. Dovetail deploys a draft model on the GPU to perform preliminary predictions, while a target model running on the CPU validates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Neural Networks and Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ALIGN
