Regularized Off-Policy TD-Learning
Bo Liu, Sridhar Mahadevan, Ji Liu

TL;DR
This paper introduces RO-TD, a regularized off-policy TD-learning algorithm that efficiently learns sparse value function representations with proven convergence and feature selection capabilities.
Contribution
It combines off-policy convergent gradient methods with convex regularization, enabling sparse learning and low computational complexity.
Findings
RO-TD converges off-policy
It effectively selects sparse features
It demonstrates low computational cost
Abstract
We present a novel regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying RO-TD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Reinforcement Learning in Robotics · Sparse and Compressive Sensing Techniques
MethodsFeature Selection
