Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation
Woojin Chae, Dabeen Lee

TL;DR
This paper introduces a computationally efficient algorithm for infinite-horizon average-reward linear MDPs that achieves near-optimal regret bounds, advancing reinforcement learning in continuous state spaces.
Contribution
It presents the first computationally tractable algorithm with provable regret bounds for linear MDPs under the Bellman optimality condition.
Findings
Achieves regret of ( d^{3/2} sp(v^*) \u00f7 T ) for linear MDPs.
Attains regret of ( d sp(v^*) T ) for linear mixture MDPs.
Uses novel techniques to control the covering number of value function class and span of optimistic estimators.
Abstract
This paper proposes a computationally tractable algorithm for learning infinite-horizon average-reward linear Markov decision processes (MDPs) and linear mixture MDPs under the Bellman optimality condition. While guaranteeing computational efficiency, our algorithm for linear MDPs achieves the best-known regret upper bound of over time steps where is the span of the optimal bias function and is the dimension of the feature mapping. For linear mixture MDPs, our algorithm attains a regret bound of . The algorithm applies novel techniques to control the covering number of the value function class and the span of optimistic estimators of the value function, which is of independent interest.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTraffic control and management · Autonomous Vehicle Technology and Safety · Reinforcement Learning in Robotics
