DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled   Hierarchical Reinforcement Learning

Utsav Singh; Souradip Chakraborty; Wesley A. Suttle; Brian M. Sadler,; Vinay P Namboodiri; Amrit Singh Bedi

arXiv:2406.10892·cs.LG·January 3, 2025

DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled Hierarchical Reinforcement Learning

Utsav Singh, Souradip Chakraborty, Wesley A. Suttle, Brian M. Sadler,, Vinay P Namboodiri, Amrit Singh Bedi

PDF

Open Access

TL;DR

DIPPER introduces a hierarchical reinforcement learning method that uses direct preference optimization and primitive-informed regularization to efficiently learn complex robotics tasks from limited human preferences, outperforming existing methods.

Contribution

The paper presents DIPPER, a novel hierarchical RL approach combining direct preference optimization with primitive-informed regularization, addressing efficiency and subgoal generation issues.

Findings

01

DIPPER outperforms hierarchical and non-hierarchical baselines.

02

It improves computational efficiency over standard preference-based methods.

03

It mitigates non-stationarity and infeasible subgoal issues in hierarchical RL.

Abstract

Learning control policies to perform complex robotics tasks from human preference data presents significant challenges. On the one hand, the complexity of such tasks typically requires learning policies to perform a variety of subtasks, then combining them to achieve the overall goal. At the same time, comprehensive, well-engineered reward functions are typically unavailable in such problems, while limited human preference data often is; making efficient use of such data to guide learning is therefore essential. Methods for learning to perform complex robotics tasks from human preference data must overcome both these challenges simultaneously. In this work, we introduce DIPPER: Direct Preference Optimization to Accelerate Primitive-Enabled Hierarchical Reinforcement Learning, an efficient hierarchical approach that leverages direct preference optimization to learn a higher-level policy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSmart Parking Systems Research