A model-free first-order method for linear quadratic regulator with $\tilde{O}(1/\varepsilon)$ sampling complexity
Caleb Ju, Georgios Kotsalis, Guanghui Lan

TL;DR
This paper introduces a model-free first-order policy gradient method for stochastic LQR that achieves near-optimal sampling complexity of O(1/) without requiring all policies to be stable, advancing reinforcement learning efficiency.
Contribution
It presents a novel actor-critic algorithm for stochastic LQR with improved O(1/) sample complexity, matching model-based rates and removing stability assumptions.
Findings
Achieves O(1/) sample complexity for stochastic LQR.
Utilizes a variational inequality formulation and a stochastic primal-dual critic.
Demonstrates optimal convergence rates with a multi-epoch scheme.
Abstract
We consider the classic stochastic linear quadratic regulator (LQR) problem under an infinite horizon average stage cost. By leveraging recent policy gradient methods from reinforcement learning, we obtain a first-order method that finds a stable feedback law whose objective function gap to the optima is at most with high probability using samples, where hides polylogarithmic dependence on . Our proposed method seems to have the best dependence on within the model-free literature without the assumption that all policies generated by the algorithm are stable almost surely, and it matches the best-known rate from the model-based literature, up to logarithmic factors. The improved dependence on is achieved by showing the accuracy scales with the variance rather than the standard deviation of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Adaptive Dynamic Programming Control
