LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems

Yufei Li; Zexin Li; Yinglun Zhu; Cong Liu

arXiv:2507.21276·cs.AI·July 30, 2025

LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems

Yufei Li, Zexin Li, Yinglun Zhu, Cong Liu

PDF

TL;DR

LeMix is a system that co-locates LLM training and inference workloads on multi-GPU systems, improving resource utilization and response times by dynamically scheduling concurrent tasks based on workload predictions.

Contribution

LeMix introduces a unified scheduling framework that manages simultaneous LLM training and inference, addressing inefficiencies of traditional separate deployments.

Findings

01

Up to 3.53x throughput improvement

02

Inference loss reduced by up to 0.61x

03

Response time SLO attainment increased by up to 2.12x

Abstract

Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in isolated phases, causing substantial inefficiencies (e.g., GPU idleness) and delayed adaptation to new data in distributed settings. Our empirical analysis reveals that these inefficiencies stem from dynamic request arrivals during serving and workload heterogeneity in pipeline-parallel training. To address these challenges, we propose LeMix, a system for co-locating and managing concurrent LLM serving and training workloads. LeMix integrates offline profiling, execution prediction mechanisms, and runtime scheduling to dynamically adapt resource allocation based on workload characteristics and system conditions. By understanding task-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.