Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach

Utsav Singh; Souradip Chakraborty; Wesley A. Suttle; Brian M. Sadler; Derrik E. Asher; Anit Kumar Sahu; Mubarak Shah; Vinay P. Namboodiri; Amrit Singh Bedi

arXiv:2411.00361·cs.LG·April 17, 2026

Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach

Utsav Singh, Souradip Chakraborty, Wesley A. Suttle, Brian M. Sadler, Derrik E. Asher, Anit Kumar Sahu, Mubarak Shah, Vinay P. Namboodiri, Amrit Singh Bedi

PDF

1 Video

TL;DR

This paper introduces DIPPER, a hierarchical reinforcement learning framework that uses direct preference optimization to address non-stationarity and infeasible subgoals, leading to significant performance improvements.

Contribution

DIPPER formulates goal-conditioned HRL as a bi-level optimization problem and employs preference-based training to enhance stability and feasibility in hierarchical policies.

Findings

01

DIPPER achieves up to 40% performance improvements over baselines.

02

The framework effectively mitigates non-stationarity in hierarchical RL.

03

It reduces infeasible subgoal generation in complex tasks.

Abstract

Hierarchical reinforcement learning (HRL) enables agents to solve complex, long-horizon tasks by decomposing them into manageable sub-tasks. However, HRL methods face two fundamental challenges: (i) non-stationarity caused by the evolving lower-level policy during training, which destabilizes higher-level learning, and (ii) the generation of infeasible subgoals that lower-level policies cannot achieve. To address these challenges, we introduce DIPPER, a novel HRL framework that formulates goal-conditioned HRL as a bi-level optimization problem and leverages direct preference optimization (DPO) to train the higher-level policy. By learning from stationary preference comparisons over subgoal sequences rather than rewards that depend on the evolving lower-level policy, DIPPER mitigates the impact of non-stationarity on hierarchical learning. To address infeasible subgoals, DIPPER…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Direct Preference Optimization for Primitive-Enabled Hierarchical RL: A Bilevel Approach· slideslive