Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Zeyuan Liu; Jeonghye Kim; Xufang Luo; Dongsheng Li; Yuqing Yang

arXiv:2602.23008·cs.LG·March 9, 2026

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

Zeyuan Liu, Jeonghye Kim, Xufang Luo, Dongsheng Li, Yuqing Yang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces EMPO$^2$, a hybrid reinforcement learning framework that enhances large language model agents' exploration and adaptability by leveraging memory and combined update strategies, leading to significant performance improvements.

Contribution

EMPO$^2$ is a novel hybrid RL approach that integrates memory with on- and off-policy updates to improve exploration and robustness in LLM agents.

Findings

01

Achieves 128.6% and 11.3% improvements over GRPO on ScienceWorld and WebShop.

02

Demonstrates superior out-of-distribution adaptability with minimal trials and no parameter updates.

03

Enhances exploration and generalization in LLM-based agents.

Abstract

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO $^{2}$ ), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO $^{2}$ achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO $^{2}$ demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO $^{2}$ as a promising framework for building more exploratory and generalizable LLM-based agents.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

Although each individual part of the proposed method is not original, combining them all together under one framework is an important contribution and has not been done before, particularly in the important but still early field of RL on LLMs. The quality is good, mainly focusing on showing saturation of two in-distribution benchmarks. The paper also demonstrated signs of life on out of distribution benchmarks. The paper also sought to understand each component’s importance by doing ablations,

Weaknesses

I am confused about the hyperparameter choices for choosing to sample between memory and non memory for rollouts and on-policy and off-policy for updates. There isn’t explanation for these choices (½ and ⅓ respectively), and there are no ablations or sweeps (although 6.3 does ablate the entire components). Some of the plots could have better reporting. For example, figure 1 B not having error bars across the seeds, or figure 8. For some of the baselines, I am concerned about the reported numbe

Reviewer 02Rating 6Confidence 3

Strengths

- Paper is well-motivated and well-written. Justification for improved exploration in RL for LLMs is sound. - Use of memory in both rollout and update phase is simple yet novel in the context of RL for LLMs. - Strong results on ScienceWorld which demonstrate the OOD generalization of their method (due to generality of memory).

Weaknesses

- Lack of ablations. The method introduces additional hyperparameters and components, the effects of which are largely undocumented. - Effect of intrinsic reward component. What is the effect of this component on the performance of the final policy (paper only documents the effect on policy entropy)? How generalizable is this reward term? It seems as if it may require further reward-shaping (i.e. tuning similarity threshold) to generalize to newer domains where naive state similarity may lea

Reviewer 03Rating 8Confidence 4

Strengths

**Overall, the paper is quite strong and I recommend acceptance of the paper.** ## Novelty The paper proposes a mechanism for self-generating memory, incorporating memory into the rollout mechanism in order to avoid past mistakes, promote exploration and achieve better rollouts. Moreover, to the best of my knowledge, this paper is the first to use off-policy learning to then distill back these **hint-augmented** prompts back into the model’s parametric knowledge. This is remarkable and also wh

Weaknesses

As mentioned above, I really like this paper. However, I would note the following weaknesses: ## Comparison on single turn reasoning tasks The idea of off-policy updates using previously generated hints can be useful beyond the tasks used in this paper. Particularly, this can help regarding single turn reasoning tasks like math/coding. This is the single most important point where the paper's results can be improved. **If the authors can demonstrate the usefulness of their framework on these

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Multimodal Machine Learning Applications