TL;DR
This paper introduces DIPPER, a hierarchical reinforcement learning framework that uses direct preference optimization to address non-stationarity and infeasible subgoals, leading to significant performance improvements.
Contribution
DIPPER formulates goal-conditioned HRL as a bi-level optimization problem and employs preference-based training to enhance stability and feasibility in hierarchical policies.
Findings
DIPPER achieves up to 40% performance improvements over baselines.
The framework effectively mitigates non-stationarity in hierarchical RL.
It reduces infeasible subgoal generation in complex tasks.
Abstract
Hierarchical reinforcement learning (HRL) enables agents to solve complex, long-horizon tasks by decomposing them into manageable sub-tasks. However, HRL methods face two fundamental challenges: (i) non-stationarity caused by the evolving lower-level policy during training, which destabilizes higher-level learning, and (ii) the generation of infeasible subgoals that lower-level policies cannot achieve. To address these challenges, we introduce DIPPER, a novel HRL framework that formulates goal-conditioned HRL as a bi-level optimization problem and leverages direct preference optimization (DPO) to train the higher-level policy. By learning from stationary preference comparisons over subgoal sequences rather than rewards that depend on the evolving lower-level policy, DIPPER mitigates the impact of non-stationarity on hierarchical learning. To address infeasible subgoals, DIPPER…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
