Temporal-Aware GPU Resource Allocation for Distributed LLM Inference via Reinforcement Learning
Chengze Du, Zhiwei Yu, Heng Xu, Haojie Wang, Bo liu, Jialong Li

TL;DR
This paper introduces TORTA, a temporal-aware scheduling framework for distributed GPU inference of large language models, which improves efficiency, responsiveness, and cost-effectiveness by considering workload dynamics over time.
Contribution
It proposes a novel two-layer reinforcement learning-based scheduling architecture that captures long-term workload patterns for better resource allocation.
Findings
Reduces average inference response time by up to 15%
Improves load balance by approximately 4-5%
Cuts total operational cost by 10-20%
Abstract
The rapid growth of large language model (LLM) services imposes increasing demands on distributed GPU inference infrastructure. Most existing scheduling systems follow a reactive paradigm, relying solely on the current system state to make decisions, without considering how task demand and resource availability evolve over time. This lack of temporal awareness in reactive approaches leads to inefficient GPU utilization, high task migration overhead, and poor system responsiveness under dynamic workloads. In this work, we identify the fundamental limitations of these instantaneous-state-only scheduling approaches and propose Temporal Optimal Resource scheduling via Two-layer Architecture (TORTA). TORTA introduces a spatiotemporal scheduling framework that captures both long-term workload patterns and short-term execution constraints. It adopts a two-layer design: a macro-level scheduler…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems
