Accuracy-Delay Trade-Off in LLM Offloading via Token-Level Uncertainty

Yumin Kim; Hyeonsu Lyu; Minjae Lee; Hyun Jong Yang

arXiv:2602.07958·eess.SY·February 10, 2026

Accuracy-Delay Trade-Off in LLM Offloading via Token-Level Uncertainty

Yumin Kim, Hyeonsu Lyu, Minjae Lee, Hyun Jong Yang

PDF

Open Access

TL;DR

This paper introduces a token-level uncertainty-aware offloading framework for LLMs in mobile edge computing, balancing inference accuracy and latency by dynamically deciding between local computation and offloading based on uncertainty metrics.

Contribution

It proposes a novel margin-based token-level uncertainty metric and a greedy offloading algorithm that improves delay-accuracy trade-offs in MEC environments.

Findings

01

GOA outperforms baseline strategies in accuracy and latency.

02

The uncertainty metric correlates well with model accuracy.

03

The framework is scalable and practical for real-world MEC settings.

Abstract

Large language models (LLMs) offer significant potential for intelligent mobile services but are computationally intensive for resource-constrained devices. Mobile edge computing (MEC) allows such devices to offload inference tasks to edge servers (ESs), yet introduces latency due to communication and serverside queuing, especially in multi-user environments. In this work, we propose an uncertainty-aware offloading framework that dynamically decides whether to perform inference locally or offload it to the ES, based on token-level uncertainty and resource constraints. We define a margin-based token-level uncertainty metric and demonstrate its correlation with model accuracy. Leveraging this metric, we design a greedy offloading algorithm (GOA) that minimizes delay while maintaining accuracy by prioritizing offloading for highuncertainty queries. Our experiments show that GOA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Advanced Neural Network Applications