Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices
Alejandro Ruiz y Mesa, Guilherme Korol, Moritz Riesterer, Jo\~ao Paulo Cardoso de Lima, Jeronimo Castrillon

TL;DR
This paper presents a compiler-assisted approach for speculative sampling that accelerates large language model inference on heterogeneous edge devices by optimizing partitioning strategies, achieving significant speedups.
Contribution
It introduces an analytical cost model that guides heterogeneous hardware partitioning for speculative decoding, addressing integration and resource exploitation challenges at the edge.
Findings
Achieves up to 1.68× speedup on edge devices for translation tasks.
Validates the cost model with real hardware, closely matching predictions.
Demonstrates effective integration of speculative sampling into compiler workflows.
Abstract
LLM deployment on resource-constrained edge devices faces severe latency constraints, particularly in real-time applications where delayed responses can compromise safety or usability. Among many approaches to mitigate the inefficiencies of sequential token-by-token generation, Speculative Decoding (SD) has emerged as a promising technique. However, SD at the edge is hindered by two major challenges: (1) integrating SD into a compiler-based workflow without sacrificing performance or programmability, and (2) exploiting the heterogeneous compute resources of modern SoCs through carefully designed partitioning strategies. This work addresses these challenges by using an analytical cost model that explores heterogeneous hardware configurations and guides coarse-grained partitioning of LLM subgraphs, particularly with edge-typical short input sequence lengths. The cost model predicts when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Distributed systems and fault tolerance
