Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization
Tanmay Ambadkar, Sourav Panda, Shreyash Kale, Jonathan Dodge, Abhinav Verma

TL;DR
This paper introduces D3PO, a PPO-based framework for multi-objective reinforcement learning that improves Pareto front discovery by decomposing optimization, stabilizing training, and encouraging diversity, outperforming prior methods.
Contribution
D3PO reorganizes multi-objective policy optimization to address gradient interference and representational collapse, enabling reliable and diverse Pareto front discovery with a single policy.
Findings
D3PO outperforms prior methods on standard MORL benchmarks.
D3PO discovers broader and higher-quality Pareto fronts.
D3PO matches or exceeds state-of-the-art hypervolume and utility metrics.
Abstract
Multi-objective reinforcement learning (MORL) seeks to learn policies that balance multiple, often conflicting objectives. Although a single preference-conditioned policy is the most flexible and scalable solution, existing approaches remain brittle in practice, frequently failing to recover complete Pareto fronts. We show that this failure stems from two structural issues in current methods: destructive gradient interference caused by premature scalarization and representational collapse across the preference space. We introduce , a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly. preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization, enabling reliable credit assignment. In addition, a scaled diversity regularizer enforces…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper identifies two challenges, the mode collapse and the conflicting objectives clearly. The proposed solutions are technically sound. Provided simulations demonstrate the effectiveness of the proposed method.
There are limited discussions on the hyperparemeters. Also, the method needs to adopt many parameters, which may be hard to scale in real MORL tasks. Examples of policy behaviors (e.g., different strategies emerging for different preferences) would help validate “behavioral diversity” more intuitively. The performance gain might be marginal compared to previous methods.
The paper presents a technically sound extension of PPO for multi-objective reinforcement learning through a preference-conditioned framework. The proposed multi-head critic with Late-Stage Weighting (LSW) and scaled diversity regularization are well-motivated and supported by reasonable theoretical intuition. The experimental results are generally good, covering both discrete and continuous multi-objective tasks and showing improvements.
The proposed ideas are promising, but the conceptual structure and method description in Section 4 could be clearer. While Figure 1 appears to illustrate the overall framework, it is never explicitly referenced or discussed in the paper, which makes it harder to connect the algorithmic details to the visual explanation. The section would benefit from clearer guidance and stronger linkage between the conceptual figure and the mathematical formulation to help readers follow the proposed mechanisms
- The paper is clear and provides an algorithm that is well-grounded - The experiments are extensive, performed on multiple benchmark environments for discrete action-spaces and continuous action-spaces, with multiple relevant baselines - The results are competitive or outperform the baselines on multiple multi-objective metrics (hypervolume, expected utility, sparsity)
My concerns are that the algorithm claims 3 contributions for their algorithm: 1) multi-head critic, 2) late-weighting loss, 3) diversity reguralization. 1. using a multi-head critic is common in many multi-policy algorithms (MORL-baselines [1] uses a multi-head critic in MOPPO, and this was already used in early deep MORL work [2]), and is thus not a contribution specific from this paper. 2. as far as I understand (I would appreciate the authors correcting me otherwise), the late-weighting lo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms · Advanced Bandit Algorithms Research
