Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

Mert Yildiz; Pietro Spadaccino; Alexey Rolich; Francesca Cuomo; Andrea Baiocchi

arXiv:2605.19593·cs.AI·May 20, 2026

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

Mert Yildiz, Pietro Spadaccino, Alexey Rolich, Francesca Cuomo, Andrea Baiocchi

PDF

TL;DR

This paper empirically investigates multi-model LLM scheduling challenges, revealing how offloading and preemption impact throughput and resource management on heterogeneous hardware, guiding future system design.

Contribution

It provides detailed insights into the performance implications of offloading and preemption for diverse LLMs, informing the design of more efficient multi-model schedulers.

Findings

01

Offloading causes non-linear, model-dependent throughput degradation.

02

Preemption overhead is dominated by model state reload, varying across models and hardware.

03

Sequence length and interconnect bandwidth significantly affect data movement and execution efficiency.

Abstract

Modern deployments of Large Language Models (LLMs) increasingly require serving multiple models with diverse architectures, sizes, and specialization on shared, heterogeneous hardware. This setting introduces new challenges for resource allocation, dispatching, and scheduling, particularly under GPU memory constraints where partial CPU-GPU offloading and preemption become necessary. While existing systems primarily optimize throughput for a single model, comparatively little work addresses multi-model scheduling under these conditions. In this paper, we present an empirical study of how different LLMs behave across hardware platforms, focusing on the performance implications of layer offloading and preemption. We show that offloading leads to strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.