Temporal-Aware GPU Resource Allocation for Distributed LLM Inference via Reinforcement Learning

Chengze Du; Zhiwei Yu; Heng Xu; Haojie Wang; Bo liu; Jialong Li

arXiv:2507.10259·cs.DC·September 17, 2025

Temporal-Aware GPU Resource Allocation for Distributed LLM Inference via Reinforcement Learning

Chengze Du, Zhiwei Yu, Heng Xu, Haojie Wang, Bo liu, Jialong Li

PDF

Open Access

TL;DR

This paper introduces TORTA, a temporal-aware scheduling framework for distributed GPU inference of large language models, which improves efficiency, responsiveness, and cost-effectiveness by considering workload dynamics over time.

Contribution

It proposes a novel two-layer reinforcement learning-based scheduling architecture that captures long-term workload patterns for better resource allocation.

Findings

01

Reduces average inference response time by up to 15%

02

Improves load balance by approximately 4-5%

03

Cuts total operational cost by 10-20%

Abstract

The rapid growth of large language model (LLM) services imposes increasing demands on distributed GPU inference infrastructure. Most existing scheduling systems follow a reactive paradigm, relying solely on the current system state to make decisions, without considering how task demand and resource availability evolve over time. This lack of temporal awareness in reactive approaches leads to inefficient GPU utilization, high task migration overhead, and poor system responsiveness under dynamic workloads. In this work, we identify the fundamental limitations of these instantaneous-state-only scheduling approaches and propose Temporal Optimal Resource scheduling via Two-layer Architecture (TORTA). TORTA introduces a spatiotemporal scheduling framework that captures both long-term workload patterns and short-term execution constraints. It adopts a two-layer design: a macro-level scheduler…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems