
TL;DR
This paper demonstrates that Deep Q-Networks (DQN) can fail to improve upon initial policies even with unlimited data, and provides a theoretical explanation for this sub-optimality using linear DQN analysis.
Contribution
It shows empirically that DQN can perform worse than the initial policy and offers a theoretical framework explaining why linear DQN converges to sub-optimal fixed points.
Findings
DQN often yields policies worse than initial guesses.
Linear DQN's limit points are fixed points of projected Bellman operators.
Fixed points may not be near-optimal or even good policies.
Abstract
A primary requirement for any reinforcement learning method is that it should produce policies that improve upon the initial guess. In this work, we show that the widely used Deep Q-Network (DQN) fails to satisfy this minimal criterion -- even when it gets to see all possible states and actions infinitely often (a condition under which tabular Q-learning is guaranteed to converge to the optimal Q-value function). Our specific contributions are twofold. First, we numerically show that DQN often returns a policy that performs worse than the initial one. Second, we offer a theoretical explanation for this phenomenon in linear DQN, a simplified version of DQN that uses linear function approximation in place of neural networks while retaining the other key components such as -greedy exploration, experience replay, and target network. Using tools from differential inclusion theory,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advancements in Semiconductor Devices and Circuit Design · Evolutionary Algorithms and Applications
MethodsSarsa
