LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems
Yufei Li, Zexin Li, Yinglun Zhu, Cong Liu

TL;DR
LeMix is a system that co-locates LLM training and inference workloads on multi-GPU systems, improving resource utilization and response times by dynamically scheduling concurrent tasks based on workload predictions.
Contribution
LeMix introduces a unified scheduling framework that manages simultaneous LLM training and inference, addressing inefficiencies of traditional separate deployments.
Findings
Up to 3.53x throughput improvement
Inference loss reduced by up to 0.61x
Response time SLO attainment increased by up to 2.12x
Abstract
Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in isolated phases, causing substantial inefficiencies (e.g., GPU idleness) and delayed adaptation to new data in distributed settings. Our empirical analysis reveals that these inefficiencies stem from dynamic request arrivals during serving and workload heterogeneity in pipeline-parallel training. To address these challenges, we propose LeMix, a system for co-locating and managing concurrent LLM serving and training workloads. LeMix integrates offline profiling, execution prediction mechanisms, and runtime scheduling to dynamically adapt resource allocation based on workload characteristics and system conditions. By understanding task-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
