Concentration bounds for temporal difference learning with linear   function approximation: The case of batch data and uniform sampling

L.A. Prashanth; Nathaniel Korda; R\'emi Munos

arXiv:1306.2557·cs.LG·January 27, 2020·1 cites

Concentration bounds for temporal difference learning with linear function approximation: The case of batch data and uniform sampling

L.A. Prashanth, Nathaniel Korda, R\'emi Munos

PDF

Open Access

TL;DR

This paper introduces a stochastic approximation method for policy evaluation with linear function approximation that reduces computational complexity and maintains convergence rates, making it suitable for large-scale data applications.

Contribution

It proposes a randomized sample-based SA method for LSTD, providing non-asymptotic bounds and demonstrating comparable convergence rates with lower complexity.

Findings

01

Achieves $O(d)$ complexity improvement over traditional LSTD.

02

Provides finite-time bounds in high probability and expectation.

03

Demonstrates practical efficiency in traffic control and news recommendation tasks.

Abstract

We propose a stochastic approximation (SA) based method with randomization of samples for policy evaluation using the least squares temporal difference (LSTD) algorithm. Our proposed scheme is equivalent to running regular temporal difference learning with linear function approximation, albeit with samples picked uniformly from a given dataset. Our method results in an $O (d)$ improvement in complexity in comparison to LSTD, where $d$ is the dimension of the data. We provide non-asymptotic bounds for our proposed method, both in high probability and in expectation, under the assumption that the matrix underlying the LSTD solution is positive definite. The latter assumption can be easily satisfied for the pathwise LSTD variant proposed in [23]. Moreover, we also establish that using our method in place of LSTD does not impact the rate of convergence of the approximate value function to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques · Age of Information Optimization

MethodsStochastic Gradient Descent