Breaking the Bias Barrier in Concave Multi-Objective Reinforcement Learning
Swetha Ganesh, Vaneet Aggarwal

TL;DR
This paper addresses the bias in policy gradient methods for concave multi-objective reinforcement learning, proposing an algorithm with optimal sample complexity guarantees by controlling gradient bias through advanced estimators.
Contribution
It introduces a Natural Policy Gradient algorithm with a multi-level Monte Carlo estimator to achieve optimal sample complexity in concave multi-objective RL.
Findings
Achieves $ ilde{O}(rac{1}{ ext{epsilon}^2})$ sample complexity for $ ext{epsilon}$-optimal policies.
Identifies the bias barrier in existing methods and overcomes it with MLMC estimator.
Shows second-order smoothness cancels bias, enabling simpler algorithms to reach optimal rates.
Abstract
While standard reinforcement learning optimizes a single reward signal, many applications require optimizing a nonlinear utility over multiple objectives, where each denotes the expected discounted return of a distinct reward function. A common approach is concave scalarization, which captures important trade-offs such as fairness and risk sensitivity. However, nonlinear scalarization introduces a fundamental challenge for policy gradient methods: the gradient depends on , while in practice only empirical return estimates are available. Because is nonlinear, the plug-in estimator is biased (), leading to persistent gradient bias that degrades sample complexity. In this work we identify and overcome this bias barrier in concave-scalarized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Stochastic Gradient Optimization Techniques · Advanced Bandit Algorithms Research
