Topology-aware Preemptive Scheduling for Co-located LLM Workloads
Ping Zhang, Lei Su, Jinjie Yang, Xin Chen

TL;DR
This paper introduces a topology-aware preemptive scheduling method for co-located large language model workloads, significantly improving resource utilization and performance by aligning resource topology with workload priorities.
Contribution
It presents a novel fine-grained topology-aware preemption approach that ensures resource topology preferences are met, enhancing efficiency for co-located LLM workloads.
Findings
Preemption efficiency increased by 55%.
Improved resource utilization in co-located workloads.
Enhanced performance for latency-sensitive LLM services.
Abstract
Hosting diverse large language model workloads in a unified resource pool through co-location is cost-effective. For example, long-running chat services generally follow diurnal traffic patterns, which inspire co-location of batch jobs to fulfill resource valleys between successive peaks, and thus to saturate resource allocation in cluster-wide scope. These heterogeneous workloads often have different business priorities, and therefore preemption can be leveraged for resource elasticity. However, workloads often have distinct topology preferences as well. The resources released by lower-priority instances may fail to meet the requirements of high-priority online services which are usually latency-sensitive. The root cause behind such mis-match is a lack of topology awareness of resource scheduler, especially during preemption. To bridge this gap, we develop a fine-grained topology-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Scheduling and Optimization Algorithms
