Sufficient Exploration for Convex Q-learning
Fan Lu, Prashant Mehta, Sean Meyn, Gergely Neu

TL;DR
This paper introduces convex Q-learning, a dual approach to logistic Q-learning, demonstrating its effectiveness and addressing numerical challenges through regularization and state-dependent sampling, especially in cases where standard Q-learning fails.
Contribution
It establishes the structure of convex Q-learning's dual, provides conditions for bounded solutions, and demonstrates its success in diverging cases like LQR.
Findings
Convex Q-learning can succeed where standard Q-learning diverges.
Regularization is necessary to prevent over-fitting in convex Q-learning.
State-dependent sampling mitigates numerical challenges in continuous-time models.
Abstract
In recent years there has been a collective research effort to find new formulations of reinforcement learning that are simultaneously more efficient and more amenable to analysis. This paper concerns one approach that builds on the linear programming (LP) formulation of optimal control of Manne. A primal version is called logistic Q-learning, and a dual variant is convex Q-learning. This paper focuses on the latter, while building bridges with the former. The main contributions follow: (i) The dual of convex Q-learning is not precisely Manne's LP or a version of logistic Q-learning, but has similar structure that reveals the need for regularization to avoid over-fitting. (ii) A sufficient condition is obtained for a bounded solution to the Q-learning LP. (iii) Simulation studies reveal numerical challenges when addressing sampled-data systems based on a continuous time model. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Control Systems Optimization · Control Systems and Identification · Adaptive Dynamic Programming Control
MethodsQ-Learning
