Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

Jungyeon Koh; Hyun Jong Yang

arXiv:2511.01695·cs.LG·December 1, 2025

Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

Jungyeon Koh, Hyun Jong Yang

PDF

Open Access

TL;DR

This paper introduces a resource-aware parallel speculative decoding framework for efficient on-device LLM inference in mobile edge computing, optimizing user association and resource allocation with deep reinforcement learning.

Contribution

It presents the first unified framework that jointly optimizes user association and resource allocation for parallel speculative decoding in MEC systems.

Findings

01

Achieves up to 28.0% reduction in end-to-end latency.

02

Average latency reduction of 23.7%.

03

Maintains inference accuracy while improving efficiency.

Abstract

The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Advanced Neural Network Applications