Optimistic Q-learning for average reward and episodic reinforcement learning
Priyank Agrawal, Shipra Agrawal

TL;DR
This paper introduces an optimistic Q-learning algorithm for average reward reinforcement learning that generalizes episodic settings, achieves regret bounds, and employs a novel operator with contraction properties.
Contribution
It presents a new optimistic Q-learning method for average reward RL under a relaxed assumption, introducing the arL operator with contraction properties, unifying episodic and non-episodic analysis.
Findings
Regret bound of H^5 S T
arL operator has strict contraction in span
Algorithm generalizes episodic and average reward settings
Abstract
We present an optimistic Q-learning algorithm for regret minimization in average reward reinforcement learning under an additional assumption on the underlying MDP that for all policies, the time to visit some frequent state is finite and upper bounded by , either in expectation or with constant probability. Our setting strictly generalizes the episodic setting and is significantly less restrictive than the assumption of bounded hitting time \textit{for all states} made by most previous literature on model-free algorithms in average reward settings. We demonstrate a regret bound of , where and are the numbers of states and actions, and is the horizon. A key technical novelty of our work is the introduction of an operator defined as where denotes the Bellman operator.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health Research Topics
MethodsQ-Learning
