Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

Swetha Ganesh; Vaneet Aggarwal

arXiv:2603.08518·cs.LG·March 10, 2026

Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning

Swetha Ganesh, Vaneet Aggarwal

PDF

Open Access

TL;DR

This paper addresses the bias in policy gradient methods for concave multi-objective reinforcement learning, proposing an algorithm with optimal sample complexity guarantees by controlling gradient bias through advanced estimators.

Contribution

It introduces a Natural Policy Gradient algorithm with a multi-level Monte Carlo estimator to achieve optimal sample complexity in concave multi-objective RL.

Findings

01

Achieves $ ilde{O}(rac{1}{ ext{epsilon}^2})$ sample complexity for $ ext{epsilon}$-optimal policies.

02

Identifies the bias barrier in existing methods and overcomes it with MLMC estimator.

03

Shows second-order smoothness cancels bias, enabling simpler algorithms to reach optimal rates.

Abstract

While standard reinforcement learning optimizes a single reward signal, many applications require optimizing a nonlinear utility $f (J_{1}^{π}, \dots, J_{M}^{π})$ over multiple objectives, where each $J_{m}^{π}$ denotes the expected discounted return of a distinct reward function. A common approach is concave scalarization, which captures important trade-offs such as fairness and risk sensitivity. However, nonlinear scalarization introduces a fundamental challenge for policy gradient methods: the gradient depends on $\partial f (J^{π})$ , while in practice only empirical return estimates $\hat{J}$ are available. Because $f$ is nonlinear, the plug-in estimator is biased ( $E [\partial f (\hat{J})] \neq = \partial f (E [\hat{J}])$ ), leading to persistent gradient bias that degrades sample complexity. In this work we identify and overcome this bias barrier in concave-scalarized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research