On Reward Structures of Markov Decision Processes
Falcon Z. Dai

TL;DR
This paper explores the structure of Markov decision processes focusing on reward functions, introduces new estimators and theoretical insights for reinforcement learning, and proposes methods for safe and multi-objective policy optimization.
Contribution
It presents a novel estimator with instance-specific error bounds, refines key MDP constants for reward-based analysis, and develops algorithms for safe and Pareto-optimal policy planning.
Findings
New estimator with $ ilde{O}(rac{ au_s}{n})$ error bound
Theoretical link between reward shaping and learning speed
Modified algorithms for safe and multi-objective reinforcement learning
Abstract
A Markov decision process can be parameterized by a transition kernel and a reward function. Both play essential roles in the study of reinforcement learning as evidenced by their presence in the Bellman equations. In our inquiry of various kinds of "costs" associated with reinforcement learning inspired by the demands in robotic applications, rewards are central to understanding the structure of a Markov decision process and reward-centric notions can elucidate important concepts in reinforcement learning. Specifically, we study the sample complexity of policy evaluation and develop a novel estimator with an instance-specific error bound of for estimating a single state value. Under the online regret minimization setting, we refine the transition-based MDP constant, diameter, into a reward-based constant, maximum expected hitting cost, and with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research
