Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices

Alejandro Ruiz y Mesa; Guilherme Korol; Moritz Riesterer; Jo\~ao Paulo Cardoso de Lima; Jeronimo Castrillon

arXiv:2602.08060·cs.LG·February 11, 2026

Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices

Alejandro Ruiz y Mesa, Guilherme Korol, Moritz Riesterer, Jo\~ao Paulo Cardoso de Lima, Jeronimo Castrillon

PDF

Open Access

TL;DR

This paper presents a compiler-assisted approach for speculative sampling that accelerates large language model inference on heterogeneous edge devices by optimizing partitioning strategies, achieving significant speedups.

Contribution

It introduces an analytical cost model that guides heterogeneous hardware partitioning for speculative decoding, addressing integration and resource exploitation challenges at the edge.

Findings

01

Achieves up to 1.68× speedup on edge devices for translation tasks.

02

Validates the cost model with real hardware, closely matching predictions.

03

Demonstrates effective integration of speculative sampling into compiler workflows.

Abstract

LLM deployment on resource-constrained edge devices faces severe latency constraints, particularly in real-time applications where delayed responses can compromise safety or usability. Among many approaches to mitigate the inefficiencies of sequential token-by-token generation, Speculative Decoding (SD) has emerged as a promising technique. However, SD at the edge is hindered by two major challenges: (1) integrating SD into a compiler-based workflow without sacrificing performance or programmability, and (2) exploiting the heterogeneous compute resources of modern SoCs through carefully designed partitioning strategies. This work addresses these challenges by using an analytical cost model that explores heterogeneous hardware configurations and guides coarse-grained partitioning of LLM subgraphs, particularly with edge-typical short input sequence lengths. The cost model predicts when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Distributed systems and fault tolerance