Foundations of Multivariate Distributional Reinforcement Learning
Harley Wiltzer, Jesse Farebrother, Arthur Gretton, Mark Rowland

TL;DR
This paper develops the first provably convergent algorithms for multivariate distributional reinforcement learning, addressing challenges in high-dimensional reward spaces and providing insights into distribution representation tradeoffs.
Contribution
It introduces oracle-free, computationally-tractable algorithms for multivariate distributional RL with convergence guarantees and novel analysis techniques for high-dimensional reward distributions.
Findings
Convergence rates match scalar reward settings.
Standard categorical TD analysis fails for reward dimensions > 1, resolved by a new projection.
Tradeoffs in distribution representations affect practical performance.
Abstract
In reinforcement learning (RL), the consideration of multivariate reward signals has led to fundamental advancements in multi-objective decision-making, transfer learning, and representation learning. This work introduces the first oracle-free and computationally-tractable algorithms for provably convergent multivariate distributional dynamic programming and temporal difference learning. Our convergence rates match the familiar rates in the scalar reward setting, and additionally provide new insights into the fidelity of approximate return distribution representations as a function of the reward dimension. Surprisingly, when the reward dimension is larger than , we show that standard analysis of categorical TD learning fails, which we resolve with a novel projection onto the space of mass- signed measures. Finally, with the aid of our technical results and simulations, we identify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsFood Supply Chain Traceability · Statistical and Computational Modeling
