Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, Bo An

TL;DR
This paper introduces Hierarchy-of-Groups Policy Optimization (HGPO), a novel method for long-horizon agentic tasks that improves advantage estimation by addressing context inconsistency through hierarchical grouping, leading to better policy performance.
Contribution
HGPO proposes a hierarchical grouping approach to improve stepwise advantage estimation in long-horizon RL without extra models or rollouts.
Findings
HGPO outperforms existing methods on ALFWorld and WebShop tasks.
Addresses context inconsistency in stepwise advantage estimation.
Achieves better bias-variance trade-off in policy optimization.
Abstract
Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks. To enable more fine-grained policy updates, recent research has increasingly shifted toward stepwise group-based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. However, we find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advantage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose Hierarchy-of-Groups Policy Optimization (HGPO) for long-horizon agentic tasks. Specifically, within a group of rollout trajectories, HGPO…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper presents a well-motivated study that clearly illustrates the limitations of trajectory-level, traditional step-level, and oracle-step advantage computations in Figure 1 and 2, thereby making the proposed method appear both natural and intuitively justified. 2. Both the high-level idea and implementation of HGPO are simple yet well-grounded, making it an effective solution to address the context-inconsistency problem. 3. HGPO introduces almost no additional computational overhead
1. The context inconsistency problem is only partially addressed in this paper rather than being fundamentally resolved. As the value of K increases, the number of grouped samples noticeably decreases. This indicates that for longer-horizon tasks, HGPO becomes less effective, since the higher-level history groups become increasingly sparse under HGPO’s grouping mechanism. 2. Therefore, HGPO is effective only when K is relatively small. In the experiments, K is set to 2 and 4, corresponding to
The paper's primary strength is its elegant and well-motivated mechanism for managing the bias-variance trade-off in advantage estimation. Rather than forcing a binary choice between GiGPO's high-bias/low-variance step-level group and a low-bias/high-variance Oracle group, HGPO proposes a more nuanced solution. By aggregating advantage signals across all K+1 levels of the hierarchy, it provides a principled interpolation that leverages the low-bias signal from high-consistency groups while retai
1. Its validity is tightly coupled to the assumption that the agent's "memory" is equivalent to its "raw historical context," as implemented in the paper's prompts (Appendix C.5). This raw-history-as-memory paradigm, while common, is limited. The proposed $C_k$-based grouping would not directly generalize to more advanced agents that utilize summarized memory, as the grouping basis would misalign with the agent's true decision-making state. 2. The claim of minimal additional time cost is unsub
1. The paper is well-written. Figure 2 of the "context inconsistency" problem clearly frames the bias-variance dilemma that existing methods face (GiGPO = high-bias, Oracle = high-variance). 2. HGPO is an intuitive and novel solution. 3. HGPO achieves sota results, significantly outperforming its baselines. 4. The ablations are comprehensive.
1. The proof of the bias-variance trade-off in Appendix B relies on the assumption $b_k \ge b_{k+1}$ (bias decreases as context depth $k$ increases). This assumption is stated but never justified or proven. The authors must provide a formal argument for this assumption or re-frame Proposition 4.1 as a heuristic analysis. 2. The paper's claim to address "long-horizon" tasks is not supported by the evidence. Appendix C.3 reveals the maximum episode lengths are T=50 (ALFWorld) and T=15 (WebShop).
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Data Classification · Multimodal Machine Learning Applications
