An Online Multiobjective Policy Gradient for Long-run Average-reward Markov Decision Process

Rahul Misra; Manuela L. Bujorianu; Rafa{\l} Wisniewski

arXiv:2511.13034·eess.SY·November 18, 2025

An Online Multiobjective Policy Gradient for Long-run Average-reward Markov Decision Process

Rahul Misra, Manuela L. Bujorianu, Rafa{\l} Wisniewski

PDF

Open Access

TL;DR

This paper introduces an online multi-objective policy gradient algorithm for Markov Decision Processes that guarantees convergence of the long-run average reward vector to a target set using Blackwell's approachability theorem.

Contribution

It develops a novel RL framework with a dynamic scalarization mechanism based on approachability theory for multi-objective optimization.

Findings

01

The algorithm converges to the target reward set under ergodic conditions.

02

The method effectively handles multiple objectives in long-run average reward settings.

03

Numerical validation demonstrates the approach's practical viability.

Abstract

We propose a reinforcement learning (RL) framework for multi-objective decision-making, where the agent seeks to optimize a vector of rewards rather than a single scalar value. The objective is to ensure that the time-averaged reward vector converges asymptotically to a predefined target set. Since standard RL algorithms operate on scalar rewards, we introduce a dynamic scalarization mechanism guided by Blackwell's Approachability Theorem. This theorem enables adaptive updates of the scalarization vector to guarantee convergence toward the target set. Assuming ergodicity, the Markov chain induced by the learned policies admits a stationary distribution, ensuring all states recur with finite return times. Our algorithm exploits this property by defining an inner loop that applies a policy gradient method (with baseline) between successive visits to a designated recurrent state, enforcing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Gaussian Processes and Bayesian Inference