Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

Jinwoo Choi; Sang-Hyun Lee; Seung-Woo Seo

arXiv:2602.03389·cs.LG·February 4, 2026

Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL

Jinwoo Choi, Sang-Hyun Lee, Seung-Woo Seo

PDF

Open Access 3 Reviews

TL;DR

This paper introduces CoGHP, a hierarchical policy framework for offline goal-conditioned reinforcement learning that models decision sequences autoregressively, enabling better handling of complex, long-horizon tasks.

Contribution

It proposes a novel autoregressive hierarchical approach with a unified architecture and MLP-Mixer backbone, advancing long-horizon offline goal-conditioned RL.

Findings

01

Outperforms strong offline baselines on navigation benchmarks

02

Improves performance on long-horizon tasks

03

Effectively models complex decision sequences

Abstract

Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separate high- and low-level networks and generate only a single intermediate subgoal, making them inadequate for complex tasks that require coordinating multiple intermediate decisions. To address this limitation, we draw inspiration from the chain-of-thought paradigm and propose the Chain-of-Goals Hierarchical Policy (CoGHP), a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture. Given a state and a final goal, CoGHP autoregressively generates a sequence of latent subgoals followed by the primitive action, where each latent subgoal acts as a reasoning step that conditions subsequent predictions. To implement this…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

* Originality: The idea of taking inspiration from Chain-of-Thought to reformulate HRL tasks with autoregressive sequence within a single unified network is novel. The work also shows that this formulation allows for end-to-end optimization, which is practical and efficient. * Significance: The experiments show that an MLP-Mixer backbone consistently performs better than stronger baselines across various tasks that uses Transformer architecture. The experiment benchmark is also complex enough

Weaknesses

* Using the term chain-of-thought is a little misleading because while it is valid as an inspiration, in practice it doesn't seem to be directly influencing the architecture. The subgoals are not explicit interpretable reasonings like their counterparts in LLMs, rather they are embeddings supervised by fixed step future states. The work could perhaps make it more clear that their contribution is focused on proposing a unified autoregressive generation and highlight that this allows for end-to-en

Reviewer 02Rating 4Confidence 3

Strengths

* The paper targets a pain point in offline goal-conditioned RL for long horizons and argues for a cohesive alternative to two-level hierarchies. Formulating hierarchical control as autoregressive subgoal generation inside one network is conceptually neat and practically appealing, since it preserves access to the final goal and allows gradients to flow through all stages. * Empirically, the method shows strong results across diverse domains. Notably, it improves success on difficult giant maze

Weaknesses

* The novelty is partly architectural refactoring and framing. The chain-of-thought analogy is evocative, but the subgoals are supervised by fixed k-step future states from trajectories. This is closer to structured imitation with value-based weighting than to learned reasoning. The methodology introduces potential train–test mismatch. Training uses teacher forcing with ground-truth subgoals, while inference relies on the model’s own subgoal predictions. The paper acknowledges teacher forcing bu

Reviewer 03Rating 4Confidence 3

Strengths

CoGHP demonstrates solid empirical performance on long-horizon tasks vs HIQL, highlighting benefits of multi-subgoal generation for navigation. The unified MLP-Mixer architecture enables end-to-end training, reducing the fragmentation in separate network methods like HIQL, and ablations confirm the causal mixer's role in complex settings.

Weaknesses

1. It seems like the chain-of-thought inspiration is superficial rather than intuitive or practical: latent subgoals are opaque embeddings, not explicit reasoning steps, making the LLM analogy more rhetorical than substantive. The LLM analogy is conceptually motivating but technically superficial. The real contributions are (a) unified autoregressive generation and (b) end-to-end training. Also, there is no ablation to test forward subgoal generation, leaving the reverse-order claim unverified.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning