Learning Dynamics and Generalization in Reinforcement Learning
Clare Lyle, Mark Rowland, Will Dabney, Marta Kwiatkowska, Yarin Gal

TL;DR
This paper analyzes the learning dynamics of temporal difference algorithms in reinforcement learning, revealing how they favor fitting non-smooth value functions early on and hinder generalization, with implications for improving robustness and transferability.
Contribution
It provides a theoretical and empirical analysis of how temporal difference learning affects generalization and proposes policy distillation as a solution.
Findings
TD learning encourages fitting non-smooth value components early.
TD algorithms weaken generalization between states.
Policy distillation improves generalization and robustness.
Abstract
Solving a reinforcement learning (RL) problem poses two competing challenges: fitting a potentially discontinuous value function, and generalizing well to new observations. In this paper, we analyze the learning dynamics of temporal difference algorithms to gain novel insight into the tension between these two objectives. We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training, and at the same time induces the second-order effect of discouraging generalization. We corroborate these findings in deep RL agents trained on a range of environments, finding that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly initialized networks and networks trained with policy gradient methods. Finally, we investigate how…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
