Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator
Karl Krauth, Stephen Tu, Benjamin Recht

TL;DR
This paper provides a finite-time analysis of approximate policy iteration for the Linear Quadratic Regulator, highlighting the sample complexity of policy evaluation and proposing an adaptive exploration method with improved regret bounds.
Contribution
The paper quantifies the sample complexity of approximate policy iteration for LQR and introduces an adaptive $ extit{ extbf{ε}}$-greedy exploration procedure with better regret performance.
Findings
Policy evaluation dominates sample complexity with $(n+d)^3/ ext{ε}^2$ samples per step.
Only $ ext{log}(1/ ext{ε})$ policy improvement steps are needed.
The adaptive procedure achieves $T^{2/3}$ regret, improving previous results.
Abstract
We study the sample complexity of approximate policy iteration (PI) for the Linear Quadratic Regulator (LQR), building on a recent line of work using LQR as a testbed to understand the limits of reinforcement learning (RL) algorithms on continuous control tasks. Our analysis quantifies the tension between policy improvement and policy evaluation, and suggests that policy evaluation is the dominant factor in terms of sample complexity. Specifically, we show that to obtain a controller that is within of the optimal LQR controller, each step of policy evaluation requires at most samples, where is the dimension of the state vector and is the dimension of the input vector. On the other hand, only policy improvement steps suffice, resulting in an overall sample complexity of . We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Reinforcement Learning in Robotics · Machine Learning and Algorithms
