An Online Multiobjective Policy Gradient for Long-run Average-reward Markov Decision Process
Rahul Misra, Manuela L. Bujorianu, Rafa{\l} Wisniewski

TL;DR
This paper introduces an online multi-objective policy gradient algorithm for Markov Decision Processes that guarantees convergence of the long-run average reward vector to a target set using Blackwell's approachability theorem.
Contribution
It develops a novel RL framework with a dynamic scalarization mechanism based on approachability theory for multi-objective optimization.
Findings
The algorithm converges to the target reward set under ergodic conditions.
The method effectively handles multiple objectives in long-run average reward settings.
Numerical validation demonstrates the approach's practical viability.
Abstract
We propose a reinforcement learning (RL) framework for multi-objective decision-making, where the agent seeks to optimize a vector of rewards rather than a single scalar value. The objective is to ensure that the time-averaged reward vector converges asymptotically to a predefined target set. Since standard RL algorithms operate on scalar rewards, we introduce a dynamic scalarization mechanism guided by Blackwell's Approachability Theorem. This theorem enables adaptive updates of the scalarization vector to guarantee convergence toward the target set. Assuming ergodicity, the Markov chain induced by the learned policies admits a stationary distribution, ensuring all states recur with finite return times. Our algorithm exploits this property by defining an inner loop that applies a policy gradient method (with baseline) between successive visits to a designated recurrent state, enforcing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference
