Policy Mirror Descent Inherently Explores Action Space
Yan Li, Guanghui Lan

TL;DR
This paper proves that online policy gradient methods can achieve optimal sample complexity without explicit exploration by introducing new evaluation operators and analysis techniques, improving efficiency and safety in reinforcement learning.
Contribution
The paper introduces two novel on-policy evaluation operators and a new analysis of stochastic policy mirror descent, enabling exploration-free policy gradient methods with optimal sample complexity.
Findings
Achieves $ ilde{O}(1/\epsilon^2)$ sample complexity without explicit exploration.
Introduces value-based estimation tailored to KL divergence.
Develops truncated on-policy Monte Carlo with strong convergence guarantees.
Abstract
Explicit exploration in the action space was assumed to be indispensable for online policy gradient methods to avoid a drastic degradation in sample complexity, for solving general reinforcement learning problems over finite state and action spaces. In this paper, we establish for the first time an sample complexity for online policy gradient methods without incorporating any exploration strategies. The essential development consists of two new on-policy evaluation operators and a novel analysis of the stochastic policy mirror descent method (SPMD). SPMD with the first evaluation operator, called value-based estimation, tailors to the Kullback-Leibler divergence. Provided the Markov chains on the state space of generated policies are uniformly mixing with non-diminishing minimal visitation measure, an sample…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Machine Learning and Algorithms
