Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits
Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda

TL;DR
This paper introduces a speculative edge-cloud decoding framework that reduces latency and operational costs for deploying large language models on edge devices by using early exits and preemptive drafting, enabling real-time applications.
Contribution
It proposes a novel edge-cloud decoding method with early exits and token preemption, improving speed and cost-efficiency for on-device LLM deployment.
Findings
Achieves up to 35% latency reduction compared to traditional cloud decoding.
Provides an 11% additional speedup through preemptive token drafting.
Demonstrates real-world deployment on a quadruped robot with 21% speed improvement.
Abstract
Large Language Models (LLMs) enable various applications on edge devices such as smartphones, wearables, and embodied robots. However, their deployment often depends on expensive cloud-based APIs, creating high operational costs, which limit access for smaller organizations and raise sustainability concerns. Certain LLMs can be deployed on-device, offering a cost-effective solution with reduced latency and improved privacy. Yet, limited computing resources constrain the size and accuracy of models that can be deployed, necessitating a collaborative design between edge and cloud. We propose a fast and cost-effective speculative edge-cloud decoding framework with a large target model on the server and a small draft model on the device. By introducing early exits in the target model, tokens are generated mid-verification, allowing the client to preemptively draft subsequent tokens before…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · IoT and Edge/Fog Computing · Advanced Neural Network Applications
