Two Time-scale Off-Policy TD Learning: Non-asymptotic Analysis over Markovian Samples
Tengyu Xu, Shaofeng Zou, Yingbin Liang

TL;DR
This paper provides the first non-asymptotic convergence analysis of two time-scale TDC algorithms under Markovian samples, demonstrating various convergence rates and proposing a new blockwise stepsize method.
Contribution
It introduces a non-asymptotic analysis of two time-scale TDC with Markovian data and proposes a blockwise diminishing stepsize algorithm for improved convergence.
Findings
TDC converges at O(log t / t^(2/3)) with diminishing stepsize
Exponential convergence of TDC with constant stepsize but with a non-zero error
Proposed blockwise stepsize TDC converges arbitrarily close to optimal with linear rate
Abstract
Gradient-based temporal difference (GTD) algorithms are widely used in off-policy learning scenarios. Among them, the two time-scale TD with gradient correction (TDC) algorithm has been shown to have superior performance. In contrast to previous studies that characterized the non-asymptotic convergence rate of TDC only under identical and independently distributed (i.i.d.) data samples, we provide the first non-asymptotic convergence analysis for two time-scale TDC under a non-i.i.d.\ Markovian sample path and linear function approximation. We show that the two time-scale TDC can converge as fast as O(log t/(t^(2/3))) under diminishing stepsize, and can converge exponentially fast under constant stepsize, but at the cost of a non-vanishing error. We further propose a TDC algorithm with blockwisely diminishing stepsize, and show that it asymptotically converges with an arbitrarily small…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Age of Information Optimization
