DHP: Discrete Hierarchical Planning for Hierarchical Reinforcement Learning Agents

Shashank Sharma; Janina Hoffmann; Vinay Namboodiri

arXiv:2502.01956·cs.RO·December 22, 2025

DHP: Discrete Hierarchical Planning for Hierarchical Reinforcement Learning Agents

Shashank Sharma, Janina Hoffmann, Vinay Namboodiri

PDF

Open Access 3 Reviews

TL;DR

DHP introduces a discrete, hierarchical planning approach for reinforcement learning that improves long-horizon visual planning, achieves high success rates, and generalizes across tasks with efficient replanning.

Contribution

The paper presents a novel discrete hierarchical planning method that replaces continuous metrics with reachability checks, enhancing planning efficiency and generalization in HRL agents.

Findings

01

Achieves 100% success in 25-room navigation

02

Sets new state-of-the-art on OGBench benchmarks

03

Requires only log N steps for replanning

Abstract

Hierarchical Reinforcement Learning (HRL) agents often struggle with long-horizon visual planning due to their reliance on error-prone distance metrics. We propose Discrete Hierarchical Planning (DHP), a method that replaces continuous distance estimates with discrete reachability checks to evaluate subgoal feasibility. DHP recursively constructs tree-structured plans by decomposing long-term goals into sequences of simpler subtasks, using a novel advantage estimation strategy that inherently rewards shorter plans and generalizes beyond training depths. In addition, to address the data efficiency challenge, we introduce an exploration strategy that generates targeted training examples for the planning modules without needing expert data. Experiments in 25-room navigation environments demonstrate a 100% success rate (vs. 90% baseline). We also present an offline variant that achieves…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- The motivation is clear and the method outlined is mostly clear (see a few clarification questions below)

Weaknesses

- The training procedure involves quite a few moving components. Especially the need for extensive exploration to ensure the CVAE offers enough coverage to select suitable sub-goals. This indicates a dependence on the base director architecture to be good enough to reach rewarding trajectories from which the explorer can further improve coverage, so might be critically dependent on the task’s reward structure. - The paper could benefit from a clear pseudocode / pictorial view of various stag

Reviewer 02Rating 2Confidence 4

Strengths

Improved performance on the 25-room navigation domain from 90% rate to 100% as shown in Table 1. The proposed approach improves the variance of the average episode length. The proposed idea is simple compared with existing HRL methods. It learns to bisect the initial state and the goal state, using the midpoint state for learning lower-level policy functions.

Weaknesses

The application of the proposed approach is limited, and it is not clear how the subtask generation returns meaningful subtasks/subgoals. Most of the related works were developed until 2020, except for a few. There are many missing HRL approaches from 2020 such as option-based HRLs, neuro-symbolic planning, and RL, or identifying subtasks/subgoals. Here is a partial list of such approaches (and there are more) * Reward machines: Exploiting reward function structure in reinforcement learning *

Reviewer 03Rating 6Confidence 4

Strengths

Reachability (binary) may avoid coupling to brittle distance metrics and naturally handles disconnected regions, as the authors claim. Contraction property of the return operators. Successful results on 25-room benchmark and competitive path lengths, where ablations show training with shallow depths still provides advantage of the proposed methods.

Weaknesses

Easy to understand the flow and contribution of the paper. The resulting model performs expertly on the standard 25-room task than the current SOTA approaches, but not on others. It would be quite sensitive to model error. The cosine_max similarity check may judge that two similar-looking states are close even though the underlying configurations differ. The paper trains a static-state MLP as an approximation, so the planning can be sensitive to such approximation.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics