GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference
Zengzipeng Tang, Yuxuan Sun, Wei Chen, Jianwen Ding, and Bo Ai

TL;DR
GELATO is a novel adaptive token offloading framework for device-edge LLM inference that maximizes throughput under energy constraints by dynamically managing resource allocation and generative uncertainty.
Contribution
It introduces a new online decision-making framework with entropy and Lyapunov mechanisms to optimize token offloading in resource-limited device-edge settings.
Findings
Achieves 64.98% higher token throughput than state-of-the-art methods.
Reduces energy consumption by 47.47% while maintaining decoding quality.
Provides a theoretical performance bound for the proposed approach.
Abstract
The recent growth of on-device Large Language Model (LLM) inference has driven significant interest in device-edge collaborative LLM inference. As a promising architecture, Speculative Decoding (SD) is increasingly adopted where a lightweight draft model rapidly generates candidate tokens to be verified by a powerful target model. However, a fundamental challenge lies in achieving per-token resource scheduling to effectively adapt SD paradigm to resource-constrained edge environment. This paper proposes a Generative Entropy- and Lyapunov-based Adaptive Token Offloading framework, named GELATO, to maximize decoding throughput under energy constraints in a device-edge collaborative SD system. Specifically, an outer drift-plus-penalty loop makes online decisions to establish a reference drafting budget, managing long-term energy-throughput trade-off. Further, a nested entropy-driven…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
