A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets

Akihiro Kubo; Kosuke Nakanishi; Shin Ishii

arXiv:2605.08946·cs.LG·May 12, 2026

A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets

Akihiro Kubo, Kosuke Nakanishi, Shin Ishii

PDF

TL;DR

This paper introduces a novel deep preference-conditioned policy for multi-objective reinforcement learning, ensuring dense Pareto front coverage and strong empirical performance across various tasks.

Contribution

It provides a theoretical foundation for preference-to-solution correspondence under nonlinear scalarization and develops a deep actor-critic algorithm with policy continuity guarantees.

Findings

01

Achieves the best average hypervolume rank among recent baselines.

02

Demonstrates strong expected-utility performance on MO-Gymnasium tasks.

03

Shows gains in continuous-control experiments.

Abstract

Preference-conditioned multi-objective reinforcement learning aims to learn a single policy that captures trade-offs across preferences, but under nonlinear scalarization the uniqueness and continuity of the preference-to-solution correspondence remain unclear. We study this problem in tabular multi-objective Markov decision processes (MDPs) using smooth Tchebycheff scalarization as a monotone utility. Under mild interior conditions on the preference set, we prove that each preference induces a unique Pareto-optimal return vector and that this vector depends Lipschitz-continuously on the preference, providing a principled foundation for preference sweeping toward dense Pareto-front coverage. To compute these targets, we formulate the problem over occupancy measures and derive Concave Mirror Descent Policy Iteration (CMDPI), which achieves an $O (1/ k)$ objective-suboptimality rate. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.