SLICE: SLO-Driven Scheduling for LLM Inference on Edge Computing Devices
Will Chow

TL;DR
SLICE is a scheduling system for edge-based LLM inference that optimizes for diverse SLOs, significantly reducing latency violations and improving task completion times compared to existing methods.
Contribution
SLICE introduces a utility-maximizing scheduling algorithm combined with dynamic control to better meet varied SLOs in edge LLM inference scenarios.
Findings
Up to 35x higher SLO attainment compared to state-of-the-art.
Achieves 3.4x faster task completion times.
Effectively handles differentiated latency requirements.
Abstract
Large Language Models (LLMs), as the foundational architecture for next-generation interactive AI applications, not only power intelligent dialogue systems but also drive the evolution of embodied intelligence on edge devices, including humanoid robots, smart vehicles, and other scenarios. The applications running on these edge devices impose differentiated Service Level Objectives (SLO) requirements on LLM services, specifically manifested as distinct constraints on Time to First Token (TTFT) and Time Per Output Token (TPOT) as well as end-to-end latency. Notably, edge devices typically handle real-time tasks that are extremely sensitive to latency, such as machine control and navigation planning. However, existing scheduling service systems still prioritize maximizing output token throughput as the sole optimization objective, failing to adequately address the diversity of SLO…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · IoT and Edge/Fog Computing · Multimodal Machine Learning Applications
