Model Selection for Average Reward RL with Application to Utility Maximization in Repeated Games
Alireza Masoumian, James R. Wright

TL;DR
This paper introduces MRBEAR, an online model selection algorithm for average reward reinforcement learning, demonstrating its efficiency in selecting models with minimal regret and applying it to strategic repeated games against opponents with unknown memory limits.
Contribution
The paper proposes MRBEAR, a novel algorithm for model selection in average reward RL, with proven regret bounds and a practical application to repeated game scenarios.
Findings
MRBEAR achieves regret bounds scaling linearly with the number of models M.
In repeated games, the algorithm's regret depends on opponent complexity and is nearly optimal.
Exponential regret dependence on opponent memory m* is proven to be unavoidable.
Abstract
In standard RL, a learner attempts to learn an optimal policy for a Markov Decision Process whose structure (e.g. state space) is known. In online model selection, a learner attempts to learn an optimal policy for an MDP knowing only that it belongs to one of model classes of varying complexity. Recent results have shown that this can be feasibly accomplished in episodic online RL. In this work, we propose , an online model selection algorithm for the average reward RL setting. The regret of the algorithm is in where represents the complexity of the simplest well-specified model class and is its corresponding regret bound. This result shows that in average reward RL, like the episodic online RL, the additional cost of model selection scales only linearly in , the number…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical and Computational Modeling
