Online Resource Allocation in Episodic Markov Decision Processes
Duksang Lee, William Overman, Dabeen Lee

TL;DR
This paper introduces an online resource allocation framework for episodic Markov decision processes with unknown, non-stationary dynamics, proposing algorithms that achieve near-optimal regret bounds in different feedback regimes.
Contribution
It develops an online dual mirror descent algorithm for resource allocation in episodic MDPs, improving regret bounds under two feedback settings.
Findings
Achieves near-optimal regret bounds in both regimes.
Demonstrates numerical efficiency on inventory management problems.
Provides theoretical analysis for non-stationary, constrained MDPs.
Abstract
This paper studies a long-term resource allocation problem over multiple periods where each period requires a multi-stage decision-making process. We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process with an unknown non-stationary transition function and stochastic non-stationary reward and resource consumption functions. We propose the observe-then-decide regime and improve the existing decide-then-observe regime, while the two settings differ in how the observations and feedback about the reward and resource consumption functions are given to the decision-maker. We develop an online dual mirror descent algorithm that achieves near-optimal regret bounds for both settings. For the observe-then-decide regime, we prove that the expected regret against the dynamic clairvoyant optimal policy is bounded by $\tilde…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsOptimization and Search Problems · Age of Information Optimization · Advanced Queuing Theory Analysis
