Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits

Yeshwanth Venkatesha; Souvik Kundu; Priyadarshini Panda

arXiv:2505.21594·cs.RO·May 29, 2025

Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits

Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda

PDF

Open Access

TL;DR

This paper introduces a speculative edge-cloud decoding framework that reduces latency and operational costs for deploying large language models on edge devices by using early exits and preemptive drafting, enabling real-time applications.

Contribution

It proposes a novel edge-cloud decoding method with early exits and token preemption, improving speed and cost-efficiency for on-device LLM deployment.

Findings

01

Achieves up to 35% latency reduction compared to traditional cloud decoding.

02

Provides an 11% additional speedup through preemptive token drafting.

03

Demonstrates real-world deployment on a quadruped robot with 21% speed improvement.

Abstract

Large Language Models (LLMs) enable various applications on edge devices such as smartphones, wearables, and embodied robots. However, their deployment often depends on expensive cloud-based APIs, creating high operational costs, which limit access for smaller organizations and raise sustainability concerns. Certain LLMs can be deployed on-device, offering a cost-effective solution with reduced latency and improved privacy. Yet, limited computing resources constrain the size and accuracy of models that can be deployed, necessitating a collaborative design between edge and cloud. We propose a fast and cost-effective speculative edge-cloud decoding framework with a large target model on the server and a small draft model on the device. By introducing early exits in the target model, tokens are generated mid-verification, allowing the client to preemptively draft subsequent tokens before…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · IoT and Edge/Fog Computing · Advanced Neural Network Applications