HAEPO: History-Aggregated Exploratory Policy Optimization
Gaurish Trivedi, Alakh Sharma, Kartikey Singh Bhandari, Dhruv Kumar, Pratik Narang, Jagat Sesh Challa

TL;DR
HAEPO introduces a history-aware exploratory loss that compresses trajectories into cumulative log-likelihoods, enabling broader exploration and stable learning in long-horizon tasks, outperforming existing methods like PPO, GRPO, and DPO.
Contribution
The paper proposes HAEPO, a novel method that leverages full-trajectory history with a cumulative log-likelihood approach and a Plackett-Luce softmax for improved exploration and stability.
Findings
HAEPO converges quickly and explores thoroughly.
It closely aligns with true rewards across tasks.
Demonstrates robustness comparable to or better than PPO, GRPO, and DPO.
Abstract
Exploration is essential in modern learning, from reinforcement learning environments with small neural policies to large language models (LLMs). Existing work, such as DPO, leverages full sequence log-likelihoods to capture an entire trajectory of the model's decisions, while methods like GRPO aggregate per-token ratios into a trajectory-level update. However, both often limit exploration on long-horizon tasks. We introduce History-Aggregated Exploratory Policy Optimization (HAEPO), a history-aware exploratory loss to combat these shortcomings. HAEPO compresses each trajectory into the sum of its logarithmic probabilities (a cumulative logarithmic likelihood), and applies a Plackett-Luce softmax across trajectories to obtain normalized weights proportional to their returns, thus encouraging broader exploration. We add entropy regularization to stabilize the aggressive updates to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
