Linear Convergence for Natural Policy Gradient with Log-linear Policy Parametrization
Carlo Alfano, Patrick Rebeschini

TL;DR
This paper proves that the natural policy gradient algorithm with log-linear policies converges linearly in both deterministic and sample-based settings for infinite-horizon discounted MDPs, extending previous results.
Contribution
It establishes linear convergence guarantees for natural policy gradient with log-linear policies, including in the presence of estimation and bias errors.
Findings
Linear convergence in deterministic case with known Q-value.
Linear convergence in sample-based case up to an error term.
Extension of previous softmax tabular results to log-linear policies.
Abstract
We analyze the convergence rate of the unregularized natural policy gradient algorithm with log-linear policy parametrizations in infinite-horizon discounted Markov decision processes. In the deterministic case, when the Q-value is known and can be approximated by a linear combination of a known feature function up to a bias error, we show that a geometrically-increasing step size yields a linear convergence rate towards an optimal policy. We then consider the sample-based case, when the best representation of the Q- value function among linear combinations of a known feature function is known up to an estimation error. In this setting, we show that the algorithm enjoys the same linear guarantees as in the deterministic case up to an error term that depends on the estimation error, the bias error, and the condition number of the feature covariance matrix. Our results build upon the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics · Age of Information Optimization
MethodsSoftmax
