Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization

Tanmay Ambadkar; Sourav Panda; Shreyash Kale; Jonathan Dodge; Abhinav Verma

arXiv:2602.07764·cs.LG·February 10, 2026

Preference Conditioned Multi-Objective Reinforcement Learning: Decomposed, Diversity-Driven Policy Optimization

Tanmay Ambadkar, Sourav Panda, Shreyash Kale, Jonathan Dodge, Abhinav Verma

PDF

Open Access 3 Reviews

TL;DR

This paper introduces D3PO, a PPO-based framework for multi-objective reinforcement learning that improves Pareto front discovery by decomposing optimization, stabilizing training, and encouraging diversity, outperforming prior methods.

Contribution

D3PO reorganizes multi-objective policy optimization to address gradient interference and representational collapse, enabling reliable and diverse Pareto front discovery with a single policy.

Findings

01

D3PO outperforms prior methods on standard MORL benchmarks.

02

D3PO discovers broader and higher-quality Pareto fronts.

03

D3PO matches or exceeds state-of-the-art hypervolume and utility metrics.

Abstract

Multi-objective reinforcement learning (MORL) seeks to learn policies that balance multiple, often conflicting objectives. Although a single preference-conditioned policy is the most flexible and scalable solution, existing approaches remain brittle in practice, frequently failing to recover complete Pareto fronts. We show that this failure stems from two structural issues in current methods: destructive gradient interference caused by premature scalarization and representational collapse across the preference space. We introduce $D^{3} P O$ , a PPO-based framework that reorganizes multi-objective policy optimization to address these issues directly. $D^{3} P O$ preserves per-objective learning signals through a decomposed optimization pipeline and integrates preferences only after stabilization, enabling reliable credit assignment. In addition, a scaled diversity regularizer enforces…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

The paper identifies two challenges, the mode collapse and the conflicting objectives clearly. The proposed solutions are technically sound. Provided simulations demonstrate the effectiveness of the proposed method.

Weaknesses

There are limited discussions on the hyperparemeters. Also, the method needs to adopt many parameters, which may be hard to scale in real MORL tasks. Examples of policy behaviors (e.g., different strategies emerging for different preferences) would help validate “behavioral diversity” more intuitively. The performance gain might be marginal compared to previous methods.

Reviewer 02Rating 4Confidence 4

Strengths

The paper presents a technically sound extension of PPO for multi-objective reinforcement learning through a preference-conditioned framework. The proposed multi-head critic with Late-Stage Weighting (LSW) and scaled diversity regularization are well-motivated and supported by reasonable theoretical intuition. The experimental results are generally good, covering both discrete and continuous multi-objective tasks and showing improvements.

Weaknesses

The proposed ideas are promising, but the conceptual structure and method description in Section 4 could be clearer. While Figure 1 appears to illustrate the overall framework, it is never explicitly referenced or discussed in the paper, which makes it harder to connect the algorithmic details to the visual explanation. The section would benefit from clearer guidance and stronger linkage between the conceptual figure and the mathematical formulation to help readers follow the proposed mechanisms

Reviewer 03Rating 4Confidence 4

Strengths

- The paper is clear and provides an algorithm that is well-grounded - The experiments are extensive, performed on multiple benchmark environments for discrete action-spaces and continuous action-spaces, with multiple relevant baselines - The results are competitive or outperform the baselines on multiple multi-objective metrics (hypervolume, expected utility, sparsity)

Weaknesses

My concerns are that the algorithm claims 3 contributions for their algorithm: 1) multi-head critic, 2) late-weighting loss, 3) diversity reguralization. 1. using a multi-head critic is common in many multi-policy algorithms (MORL-baselines [1] uses a multi-head critic in MOPPO, and this was already used in early deep MORL work [2]), and is thus not a contribution specific from this paper. 2. as far as I understand (I would appreciate the authors correcting me otherwise), the late-weighting lo

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Multi-Objective Optimization Algorithms · Advanced Bandit Algorithms Research