Reward Biased Maximum Likelihood Estimation for Reinforcement Learning

Akshay Mete; Rahul Singh; Xi Liu; P. R. Kumar

arXiv:2011.07738·cs.LG·May 18, 2021·6 cites

Reward Biased Maximum Likelihood Estimation for Reinforcement Learning

Akshay Mete, Rahul Singh, Xi Liu, P. R. Kumar

PDF

Open Access

TL;DR

This paper introduces Reward Biased Maximum Likelihood Estimation (RBMLE) for reinforcement learning, demonstrating its optimal regret bounds and superior empirical performance in controlling unknown Markov Decision Processes.

Contribution

It extends RBMLE to finite-time reinforcement learning, providing theoretical regret bounds and empirical evidence of outperforming existing algorithms.

Findings

01

RBMLE achieves $oxed{ ext{O}( ext{log } T)}$ regret for MDPs.

02

Simulation results show RBMLE outperforms UCRL2 and Thompson Sampling.

03

RBMLE exhibits competitive or superior empirical performance.

Abstract

The Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of Markov chains was proposed to overcome the central obstacle of what is variously called the fundamental "closed-identifiability problem" of adaptive control, the "dual control problem", or, contemporaneously, the "exploration vs. exploitation problem". It exploited the key observation that since the maximum likelihood parameter estimator can asymptotically identify the closed-transition probabilities under a certainty equivalent approach, the limiting parameter estimates must necessarily have an optimal reward that is less than the optimal reward attainable for the true but unknown system. Hence it proposed a counteracting reverse bias in favor of parameters with larger optimal rewards, providing a solution to the fundamental problem alluded to above. It thereby proposed an optimistic approach of favoring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Adaptive Dynamic Programming Control