Can Temporal-Difference and Q-Learning Learn Representation? A   Mean-Field Theory

Yufeng Zhang; Qi Cai; Zhuoran Yang; Yongxin Chen; Zhaoran Wang

arXiv:2006.04761·cs.LG·April 2, 2024·5 cites

Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory

Yufeng Zhang, Qi Cai, Zhuoran Yang, Yongxin Chen, Zhaoran Wang

PDF

Open Access 1 Video

TL;DR

This paper uses mean-field theory to analyze how overparameterized neural networks employed in temporal-difference and Q-learning evolve their feature representations, showing convergence to optimal features and global minimization of Bellman error.

Contribution

It provides a novel mean-field theoretical framework demonstrating convergence and optimality of feature representations in neural network-based reinforcement learning.

Findings

01

Neural networks minimize Bellman error at a sublinear rate.

02

Feature representations converge to the optimal one.

03

Analysis extends to soft Q-learning and policy gradient methods.

Abstract

Temporal-difference and Q-learning play a key role in deep reinforcement learning, where they are empowered by expressive nonlinear function approximators such as neural networks. At the core of their empirical successes is the learned feature representation, which embeds rich observations, e.g., images and texts, into the latent space that encodes semantic structures. Meanwhile, the evolution of such a feature representation is crucial to the convergence of temporal-difference and Q-learning. In particular, temporal-difference learning converges when the function approximator is linear in a feature representation, which is fixed throughout learning, and possibly diverges otherwise. We aim to answer the following questions: When the function approximator is a neural network, how does the associated feature representation evolve? If it converges, does it converge to the optimal one?…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Can Temporal-Diﬀerence and Q-Learning Learn Representation? A Mean-Field Theory· slideslive

Taxonomy

TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning

MethodsQ-Learning