A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets
Akihiro Kubo, Kosuke Nakanishi, Shin Ishii

TL;DR
This paper introduces a novel deep preference-conditioned policy for multi-objective reinforcement learning, ensuring dense Pareto front coverage and strong empirical performance across various tasks.
Contribution
It provides a theoretical foundation for preference-to-solution correspondence under nonlinear scalarization and develops a deep actor-critic algorithm with policy continuity guarantees.
Findings
Achieves the best average hypervolume rank among recent baselines.
Demonstrates strong expected-utility performance on MO-Gymnasium tasks.
Shows gains in continuous-control experiments.
Abstract
Preference-conditioned multi-objective reinforcement learning aims to learn a single policy that captures trade-offs across preferences, but under nonlinear scalarization the uniqueness and continuity of the preference-to-solution correspondence remain unclear. We study this problem in tabular multi-objective Markov decision processes (MDPs) using smooth Tchebycheff scalarization as a monotone utility. Under mild interior conditions on the preference set, we prove that each preference induces a unique Pareto-optimal return vector and that this vector depends Lipschitz-continuously on the preference, providing a principled foundation for preference sweeping toward dense Pareto-front coverage. To compute these targets, we formulate the problem over occupancy measures and derive Concave Mirror Descent Policy Iteration (CMDPI), which achieves an objective-suboptimality rate. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
