Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic
Wesley A. Suttle, Amrit Singh Bedi, Bhrij Patel, Brian M. Sadler, Alec, Koppel, Dinesh Manocha

TL;DR
This paper introduces MAC, a multi-level Monte Carlo actor-critic algorithm for average-reward RL that adapts to unknown mixing times, enabling stable learning in complex, slow-mixing environments with sparse rewards.
Contribution
The paper proposes MAC, an actor-critic method that does not require prior knowledge of mixing times, improving stability and performance in challenging RL settings.
Findings
Achieves convergence rates comparable to state-of-the-art algorithms.
Demonstrates superior practical performance in sparse reward environments.
Removes the need for exponential mixing assumptions.
Abstract
Many existing reinforcement learning (RL) methods employ stochastic gradient iteration on the back end, whose stability hinges upon a hypothesis that the data-generating process mixes exponentially fast with a rate parameter that appears in the step-size selection. Unfortunately, this assumption is violated for large state spaces or settings with sparse rewards, and the mixing time is unknown, making the step size inoperable. In this work, we propose an RL methodology attuned to the mixing time by employing a multi-level Monte Carlo estimator for the critic, the actor, and the average reward embedded within an actor-critic (AC) algorithm. This method, which we call \textbf{M}ulti-level \textbf{A}ctor-\textbf{C}ritic (MAC), is developed especially for infinite-horizon average-reward settings and neither relies on oracle knowledge of the mixing time in its parameter selection nor assumes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics
