Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory
Yufeng Zhang, Qi Cai, Zhuoran Yang, Yongxin Chen, Zhaoran Wang

TL;DR
This paper uses mean-field theory to analyze how overparameterized neural networks employed in temporal-difference and Q-learning evolve their feature representations, showing convergence to optimal features and global minimization of Bellman error.
Contribution
It provides a novel mean-field theoretical framework demonstrating convergence and optimality of feature representations in neural network-based reinforcement learning.
Findings
Neural networks minimize Bellman error at a sublinear rate.
Feature representations converge to the optimal one.
Analysis extends to soft Q-learning and policy gradient methods.
Abstract
Temporal-difference and Q-learning play a key role in deep reinforcement learning, where they are empowered by expressive nonlinear function approximators such as neural networks. At the core of their empirical successes is the learned feature representation, which embeds rich observations, e.g., images and texts, into the latent space that encodes semantic structures. Meanwhile, the evolution of such a feature representation is crucial to the convergence of temporal-difference and Q-learning. In particular, temporal-difference learning converges when the function approximator is linear in a feature representation, which is fixed throughout learning, and possibly diverges otherwise. We aim to answer the following questions: When the function approximator is a neural network, how does the associated feature representation evolve? If it converges, does it converge to the optimal one?…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
MethodsQ-Learning
