Concentration bounds for temporal difference learning with linear function approximation: The case of batch data and uniform sampling
L.A. Prashanth, Nathaniel Korda, R\'emi Munos

TL;DR
This paper introduces a stochastic approximation method for policy evaluation with linear function approximation that reduces computational complexity and maintains convergence rates, making it suitable for large-scale data applications.
Contribution
It proposes a randomized sample-based SA method for LSTD, providing non-asymptotic bounds and demonstrating comparable convergence rates with lower complexity.
Findings
Achieves $O(d)$ complexity improvement over traditional LSTD.
Provides finite-time bounds in high probability and expectation.
Demonstrates practical efficiency in traffic control and news recommendation tasks.
Abstract
We propose a stochastic approximation (SA) based method with randomization of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our proposed scheme is equivalent to running regular temporal difference learning with linear function approximation, albeit with samples picked uniformly from a given dataset. Our method results in an improvement in complexity in comparison to LSTD, where is the dimension of the data. We provide non-asymptotic bounds for our proposed method, both in high probability and in expectation, under the assumption that the matrix underlying the LSTD solution is positive definite. The latter assumption can be easily satisfied for the pathwise LSTD variant proposed in [23]. Moreover, we also establish that using our method in place of LSTD does not impact the rate of convergence of the approximate value function to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Age of Information Optimization
MethodsStochastic Gradient Descent
