An Analysis of Action-Value Temporal-Difference Methods That Learn State Values
Brett Daley, Prabhat Nagarajan, Martha White, Marlos C. Machado

TL;DR
This paper analyzes the convergence and efficiency of action-value TD methods that learn state values as intermediates, introducing a new algorithm that outperforms existing methods in benchmark tests.
Contribution
It provides a theoretical comparison of QV-learning and AV-learning, and introduces RDQ, a novel AV-learning algorithm with superior performance.
Findings
AV-learning offers major benefits over Q-learning in control tasks.
Both families outperform Expected Sarsa in prediction tasks.
RDQ significantly outperforms Dueling DQN in MinAtar benchmarks.
Abstract
The hallmark feature of temporal-difference (TD) learning is bootstrapping: using value predictions to generate new value predictions. The vast majority of TD methods for control learn a policy by bootstrapping from a single action-value function (e.g., Q-learning and Sarsa). Significantly less attention has been given to methods that bootstrap from two asymmetric value functions: i.e., methods that learn state values as an intermediate step in learning action values. Existing algorithms in this vein can be categorized as either QV-learning or AV-learning. Though these algorithms have been investigated to some degree in prior work, it remains unclear if and when it is advantageous to learn two value functions instead of just one -- and whether such approaches are theoretically sound in general. In this paper, we analyze these algorithmic families in terms of convergence and sample…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making
