Backstepping Temporal Difference Learning

Han-Dong Lim; Donghwan Lee

arXiv:2302.09875·cs.LG·April 21, 2025

Backstepping Temporal Difference Learning

Han-Dong Lim, Donghwan Lee

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel off-policy TD-learning algorithm using backstepping control theory, ensuring convergence and stability where traditional methods diverge, validated through experiments.

Contribution

It presents a unified control-theoretic framework for off-policy TD algorithms and proposes a new convergent method based on backstepping techniques.

Findings

01

Proposed algorithm guarantees convergence in unstable environments.

02

Experimental results confirm stability where standard TD-learning diverges.

03

Unified control perspective enhances understanding of off-policy learning stability.

Abstract

Off-policy learning ability is an important feature of reinforcement learning (RL) for practical applications. However, even one of the most elementary RL algorithms, temporal-difference (TD) learning, is known to suffer form divergence issue when the off-policy scheme is used together with linear function approximation. To overcome the divergent behavior, several off-policy TD-learning algorithms, including gradient-TD learning (GTD), and TD-learning with correction (TDC), have been developed until now. In this work, we provide a unified view of such algorithms from a purely control-theoretic perspective, and propose a new convergent algorithm. Our method relies on the backstepping technique, which is widely used in nonlinear control theory. Finally, convergence of the proposed algorithm is experimentally verified in environments where the standard TD-learning is known to be unstable.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Backstepping Temporal Difference Learning· slideslive

Taxonomy

TopicsReceptor Mechanisms and Signaling