Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping
Dongruo Zhou, Jiafan He, Quanquan Gu

TL;DR
This paper introduces a new reinforcement learning algorithm for discounted MDPs with feature mappings, achieving near-optimal regret bounds without requiring a generative model or ergodicity assumptions.
Contribution
The paper presents the first polynomial regret bound for feature-based RL in discounted MDPs without strong assumptions, and establishes near-matching lower bounds.
Findings
Achieves regret of $ ilde O(drac{ oot T}{(1-gamma)^2})$
Provides a lower bound of $oldsymbol{ ilde Omega(drac{ oot T}{(1-gamma)^{1.5}})}$
Demonstrates near-optimality of the proposed algorithm
Abstract
Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low-dimensional space. In this paper, we study reinforcement learning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm that makes use of the feature mapping and obtains a regret, where is the dimension of the feature space, is the time horizon and is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing the generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Machine Learning and Algorithms
