Offline Hierarchical Reinforcement Learning via Inverse Optimization

Carolin Schmidt; Daniele Gammelli; James Harrison; Marco Pavone,; Filipe Rodrigues

arXiv:2410.07933·cs.LG·March 19, 2025

Offline Hierarchical Reinforcement Learning via Inverse Optimization

Carolin Schmidt, Daniele Gammelli, James Harrison, Marco Pavone,, Filipe Rodrigues

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces OHIO, a novel offline hierarchical reinforcement learning framework that recovers unobservable high-level actions via inverse optimization, enabling effective training of hierarchical policies from static datasets.

Contribution

The paper proposes a new inverse optimization-based approach for offline hierarchical RL that addresses unobservable actions and policy structure mismatches.

Findings

01

Outperforms end-to-end RL methods in robotic tasks

02

Enhances robustness of hierarchical policies

03

Effective in network optimization problems

Abstract

Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a significant challenge. Crucially, actions taken by higher-level policies may not be directly observable within hierarchical controllers, and the offline dataset might have been generated using a different policy structure, hindering the use of standard offline learning algorithms. In this work, we propose OHIO: a framework for offline reinforcement learning (RL) of hierarchical policies. Our framework leverages knowledge of the policy structure to solve the \textit{inverse problem}, recovering the unobservable high-level actions that likely generated the observed data under our hierarchical…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The approach to discovering the high level actions given a set of state transitions is simple. Select the high level action with the highest likelihood. - Significant empirical outperformance versus baseslines - Empirical performance was robust to different controllers and modeling errors - Applied algorithm to domains with very high dimensional action spaces (i.e., the network optimization problems)

Weaknesses

The approach does appear to have significant limitations. 1. The approach assumes access to an approximate model of the transition dynamics. The authors argue this can be relatively simple to learn because it only requires a single step, but this can be difficult in high-dimensional and stochastic settings. 2. The approach often assumes a pre-trained low-level controller.

Reviewer 02Rating 8Confidence 4

Strengths

Very well written, clearly motivated, thoroughly validated both empirically and analytically. To the best of my knowledge this work addresses an important gap in the literature, the possibility to learn high-level actions that parameterize arbitrary low-level policies from state-only trajectories (and assuming approximate knowledge about transition dynamics).

Weaknesses

Not much here to be honest. The only thing that comes to mind is that it was not immediately clear to me how to solve Eq (6) in practice and in the general case, but this becomes very clear with Appendix A.4 and algorithm 2 and 3. Perhaps the content of A.4 could be more integrated into the main method section. It seems to me like method mainly deals with the implicit policy case, therefore, I am not sure how much value is being added by discussing the explicit policy case.

Reviewer 03Rating 5Confidence 3

Strengths

1. The paper presents an interesting approach for effectively leveraging the inherent hierarchical structure for downstream learning. 2. The authors show impressive results on roobotic and network optimization environments, with many ablations pertaining the hierarchical structure. 3. It covers good related work section and the limitations of this method are clearly mentioned in the discussion section.

Weaknesses

1. The main weaknesses in this paper are the experiments. Although the authors mention prior related hierarchical approaches for offline learning, the comparisons with such sophisticated approaches are missing in the experiment section. This makes it hard to judge the efficacy of the method against prior approaches. A more detailed comparison with such baselines on complex multi-level reasoning tasks is thus necessary. 2. Further, Algorithm 1 provides minimal information for the hierarchical fo

Videos

Offline Hierarchical Reinforcement Learning via Inverse Optimization· slideslive

Taxonomy

TopicsElevator Systems and Control · Reinforcement Learning in Robotics