Borrowing From the Future: An Attempt to Address Double Sampling
Yuhua Zhu, Lexing Ying

TL;DR
This paper introduces new algorithms to address the double sampling problem in model-free reinforcement learning by borrowing future randomness, demonstrating their effectiveness through theoretical analysis and numerical experiments.
Contribution
The paper proposes novel algorithms that mitigate the double sampling issue in stochastic Bellman residual minimization by leveraging future randomness, with theoretical guarantees and empirical validation.
Findings
Algorithms perform close to unbiased stochastic gradient descent when the transition kernel varies slowly.
Numerical results confirm the theoretical advantages in tabular and neural network settings.
The approach effectively reduces bias caused by double sampling in reinforcement learning.
Abstract
For model-free reinforcement learning, one of the main difficulty of stochastic Bellman residual minimization is the double sampling problem, i.e., while only one single sample for the next state is available in the model-free setting, two independent samples for the next state are required in order to perform unbiased stochastic gradient descent. We propose new algorithms for addressing this problem based on the idea of borrowing extra randomness from the future. When the transition kernel varies slowly with respect to the state, it is shown that the training trajectory of new algorithms is close to the one of unbiased stochastic gradient descent. Numerical results for policy evaluation in both tabular and neural network settings are provided to confirm the theoretical findings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Stochastic Gradient Optimization Techniques
