SLA-Aware Distributed LLM Inference Across Device-RAN-Cloud

Hariz Yet; Nguyen Thanh Tam; Mao V. Ngo; Lim Yi Shen; Lin Wei; Jihong Park; Binbin Chen; and Tony Q. S. Quek

arXiv:2602.23722·cs.NI·March 2, 2026

SLA-Aware Distributed LLM Inference Across Device-RAN-Cloud

Hariz Yet, Nguyen Thanh Tam, Mao V. Ngo, Lim Yi Shen, Lin Wei, Jihong Park, Binbin Chen, and Tony Q. S. Quek

PDF

Open Access

TL;DR

This paper evaluates the latency and feasibility of distributed large language model inference across device, RAN-edge, and cloud tiers in 5G networks, highlighting model choices and infrastructure strategies to meet real-time constraints.

Contribution

It provides empirical measurements and insights into SLA feasibility for distributed LLM inference in 5G RAN environments, emphasizing model quantization and GPU isolation techniques.

Findings

01

On-device inference exceeds sub-second latency requirements.

02

Quantized models at RAN edge meet 0.5s deadline more reliably.

03

Cloud inference can meet 1.0s deadline but struggles with 0.5s under network conditions.

Abstract

Embodied AI requires sub-second inference near the Radio Access Network (RAN), but deployments span heterogeneous tiers (on-device, RAN-edge, cloud) and must not disrupt real-time baseband processing. We report measurements from a 5G Standalone (SA) AI-RAN testbed using a fixed baseline policy for repeatability. The setup includes an on-device tier, a three-node RAN-edge cluster co-hosting a containerized 5G RAN, and a cloud tier. We find that on-device execution remains multi-second and fails to meet sub-second budgets. At the RAN edge, SLA feasibility is primarily determined by model variant choice: quantized models concentrate below 0.5\,s, while unquantized and some larger quantized models incur deadline misses due to stalls and queuing. In the cloud tier, meeting a 0.5\,s deadline is challenging on the measured WAN path (up to 32.9\% of requests complete within 0.5\,s), but all…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware-Defined Networks and 5G · IoT Networks and Protocols · Advanced MIMO Systems Optimization