GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference

Zengzipeng Tang; Yuxuan Sun; Wei Chen; Jianwen Ding; and Bo Ai

arXiv:2605.10124·cs.NI·May 12, 2026

GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference

Zengzipeng Tang, Yuxuan Sun, Wei Chen, Jianwen Ding, and Bo Ai

PDF

TL;DR

GELATO is a novel adaptive token offloading framework for device-edge LLM inference that maximizes throughput under energy constraints by dynamically managing resource allocation and generative uncertainty.

Contribution

It introduces a new online decision-making framework with entropy and Lyapunov mechanisms to optimize token offloading in resource-limited device-edge settings.

Findings

01

Achieves 64.98% higher token throughput than state-of-the-art methods.

02

Reduces energy consumption by 47.47% while maintaining decoding quality.

03

Provides a theoretical performance bound for the proposed approach.

Abstract

The recent growth of on-device Large Language Model (LLM) inference has driven significant interest in device-edge collaborative LLM inference. As a promising architecture, Speculative Decoding (SD) is increasingly adopted where a lightweight draft model rapidly generates candidate tokens to be verified by a powerful target model. However, a fundamental challenge lies in achieving per-token resource scheduling to effectively adapt SD paradigm to resource-constrained edge environment. This paper proposes a Generative Entropy- and Lyapunov-based Adaptive Token Offloading framework, named GELATO, to maximize decoding throughput under energy constraints in a device-edge collaborative SD system. Specifically, an outer drift-plus-penalty loop makes online decisions to establish a reference drafting budget, managing long-term energy-throughput trade-off. Further, a nested entropy-driven…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.