Regularized Off-Policy TD-Learning

Bo Liu; Sridhar Mahadevan; Ji Liu

arXiv:2006.05314·cs.LG·June 11, 2020·19 cites

Regularized Off-Policy TD-Learning

Bo Liu, Sridhar Mahadevan, Ji Liu

PDF

Open Access

TL;DR

This paper introduces RO-TD, a regularized off-policy TD-learning algorithm that efficiently learns sparse value function representations with proven convergence and feature selection capabilities.

Contribution

It combines off-policy convergent gradient methods with convex regularization, enabling sparse learning and low computational complexity.

Findings

01

RO-TD converges off-policy

02

It effectively selects sparse features

03

It demonstrates low computational cost

Abstract

We present a novel $l_{1}$ regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying RO-TD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics · Sparse and Compressive Sensing Techniques

MethodsFeature Selection