Chain-of-Goals Hierarchical Policy for Long-Horizon Offline Goal-Conditioned RL
Jinwoo Choi, Sang-Hyun Lee, Seung-Woo Seo

TL;DR
This paper introduces CoGHP, a hierarchical policy framework for offline goal-conditioned reinforcement learning that models decision sequences autoregressively, enabling better handling of complex, long-horizon tasks.
Contribution
It proposes a novel autoregressive hierarchical approach with a unified architecture and MLP-Mixer backbone, advancing long-horizon offline goal-conditioned RL.
Findings
Outperforms strong offline baselines on navigation benchmarks
Improves performance on long-horizon tasks
Effectively models complex decision sequences
Abstract
Offline goal-conditioned reinforcement learning remains challenging for long-horizon tasks. While hierarchical approaches mitigate this issue by decomposing tasks, most existing methods rely on separate high- and low-level networks and generate only a single intermediate subgoal, making them inadequate for complex tasks that require coordinating multiple intermediate decisions. To address this limitation, we draw inspiration from the chain-of-thought paradigm and propose the Chain-of-Goals Hierarchical Policy (CoGHP), a novel framework that reformulates hierarchical decision-making as autoregressive sequence modeling within a unified architecture. Given a state and a final goal, CoGHP autoregressively generates a sequence of latent subgoals followed by the primitive action, where each latent subgoal acts as a reasoning step that conditions subsequent predictions. To implement this…
Peer Reviews
Decision·Submitted to ICLR 2026
* Originality: The idea of taking inspiration from Chain-of-Thought to reformulate HRL tasks with autoregressive sequence within a single unified network is novel. The work also shows that this formulation allows for end-to-end optimization, which is practical and efficient. * Significance: The experiments show that an MLP-Mixer backbone consistently performs better than stronger baselines across various tasks that uses Transformer architecture. The experiment benchmark is also complex enough
* Using the term chain-of-thought is a little misleading because while it is valid as an inspiration, in practice it doesn't seem to be directly influencing the architecture. The subgoals are not explicit interpretable reasonings like their counterparts in LLMs, rather they are embeddings supervised by fixed step future states. The work could perhaps make it more clear that their contribution is focused on proposing a unified autoregressive generation and highlight that this allows for end-to-en
* The paper targets a pain point in offline goal-conditioned RL for long horizons and argues for a cohesive alternative to two-level hierarchies. Formulating hierarchical control as autoregressive subgoal generation inside one network is conceptually neat and practically appealing, since it preserves access to the final goal and allows gradients to flow through all stages. * Empirically, the method shows strong results across diverse domains. Notably, it improves success on difficult giant maze
* The novelty is partly architectural refactoring and framing. The chain-of-thought analogy is evocative, but the subgoals are supervised by fixed k-step future states from trajectories. This is closer to structured imitation with value-based weighting than to learned reasoning. The methodology introduces potential train–test mismatch. Training uses teacher forcing with ground-truth subgoals, while inference relies on the model’s own subgoal predictions. The paper acknowledges teacher forcing bu
CoGHP demonstrates solid empirical performance on long-horizon tasks vs HIQL, highlighting benefits of multi-subgoal generation for navigation. The unified MLP-Mixer architecture enables end-to-end training, reducing the fragmentation in separate network methods like HIQL, and ablations confirm the causal mixer's role in complex settings.
1. It seems like the chain-of-thought inspiration is superficial rather than intuitive or practical: latent subgoals are opaque embeddings, not explicit reasoning steps, making the LLM analogy more rhetorical than substantive. The LLM analogy is conceptually motivating but technically superficial. The real contributions are (a) unified autoregressive generation and (b) end-to-end training. Also, there is no ablation to test forward subgoal generation, leaving the reverse-order claim unverified.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
