
TL;DR
This paper introduces novel two-step Q-learning algorithms that improve convergence and reduce bias without importance sampling, demonstrating superior performance on benchmark problems.
Contribution
It proposes off-policy two-step Q-learning algorithms and their smooth variants, with convergence proofs and empirical validation showing improved results.
Findings
Algorithms are robust and easy to implement.
Proposed methods outperform existing algorithms on benchmark problems.
Convergence to optimal Q-values is theoretically established.
Abstract
Q-learning is a stochastic approximation version of the classic value iteration. The literature has established that Q-learning suffers from both maximization bias and slower convergence. Recently, multi-step algorithms have shown practical advantages over existing methods. This paper proposes a novel off-policy two-step Q-learning algorithms, without importance sampling. With suitable assumption it was shown that, iterates in the proposed two-step Q-learning is bounded and converges almost surely to the optimal Q-values. This study also address the convergence analysis of the smooth version of two-step Q-learning, i.e., by replacing max function with the log-sum-exp function. The proposed algorithms are robust and easy to implement. Finally, we test the proposed algorithms on benchmark problems such as the roulette problem, maximization bias problem, and randomly generated Markov…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies
MethodsQ-Learning
