SLICE: SLO-Driven Scheduling for LLM Inference on Edge Computing Devices

Will Chow

arXiv:2510.18544·cs.DC·November 19, 2025

SLICE: SLO-Driven Scheduling for LLM Inference on Edge Computing Devices

Will Chow

PDF

Open Access

TL;DR

SLICE is a scheduling system for edge-based LLM inference that optimizes for diverse SLOs, significantly reducing latency violations and improving task completion times compared to existing methods.

Contribution

SLICE introduces a utility-maximizing scheduling algorithm combined with dynamic control to better meet varied SLOs in edge LLM inference scenarios.

Findings

01

Up to 35x higher SLO attainment compared to state-of-the-art.

02

Achieves 3.4x faster task completion times.

03

Effectively handles differentiated latency requirements.

Abstract

Large Language Models (LLMs), as the foundational architecture for next-generation interactive AI applications, not only power intelligent dialogue systems but also drive the evolution of embodied intelligence on edge devices, including humanoid robots, smart vehicles, and other scenarios. The applications running on these edge devices impose differentiated Service Level Objectives (SLO) requirements on LLM services, specifically manifested as distinct constraints on Time to First Token (TTFT) and Time Per Output Token (TPOT) as well as end-to-end latency. Notably, edge devices typically handle real-time tasks that are extremely sensitive to latency, such as machine control and navigation planning. However, existing scheduling service systems still prioritize maximizing output token throughput as the sole optimization objective, failing to adequately address the diversity of SLO…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · IoT and Edge/Fog Computing · Multimodal Machine Learning Applications