TL;DR
This paper develops a PPO-based reinforcement learning approach for complex queueing network control problems, demonstrating superior performance over existing heuristics across various traffic conditions.
Contribution
It extends APG methods like PPO to infinite state space queueing networks with unbounded costs, incorporating variance reduction techniques for effective control policy learning.
Findings
PPO outperforms state-of-the-art heuristics in diverse traffic scenarios.
The proposed variance reduction techniques improve the stability and accuracy of value function estimation.
Near-optimal policies are achieved when the optimal solution is known.
Abstract
Novel advanced policy gradient (APG) methods, such as Trust Region policy optimization and Proximal policy optimization (PPO), have become the dominant reinforcement learning algorithms because of their ease of implementation and good practical performance. A conventional setup for notoriously difficult queueing network control problems is a Markov decision problem (MDP) that has three features: infinite state space, unbounded costs, and long-run average cost objective. We extend the theoretical framework of these APG methods for such MDP problems. The resulting PPO algorithm is tested on a parallel-server system and large-size multiclass queueing networks. The algorithm consistently generates control policies that outperform state-of-art heuristics in literature in a variety of load conditions from light to heavy traffic. These policies are demonstrated to be near-optimal when the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsEntropy Regularization · Proximal Policy Optimization
