Memory-Based Advantage Shaping for LLM-Guided Reinforcement Learning
Narjes Nourzad, Carlee Joe-Wong

TL;DR
This paper introduces a memory-based advantage shaping method for LLM-guided reinforcement learning that improves sample efficiency by leveraging a memory graph of subgoals and trajectories, reducing reliance on frequent LLM calls.
Contribution
The authors propose a novel memory graph approach that encodes subgoals and trajectories to guide RL, minimizing the need for continuous LLM interaction and enhancing learning efficiency.
Findings
Improved sample efficiency in benchmark environments.
Faster early learning compared to baseline RL methods.
Final performance comparable to methods with frequent LLM use.
Abstract
In environments with sparse or delayed rewards, reinforcement learning (RL) incurs high sample complexity due to the large number of interactions needed for learning. This limitation has motivated the use of large language models (LLMs) for subgoal discovery and trajectory guidance. While LLMs can support exploration, frequent reliance on LLM calls raises concerns about scalability and reliability. We address these challenges by constructing a memory graph that encodes subgoals and trajectories from both LLM guidance and the agent's own successful rollouts. From this graph, we derive a utility function that evaluates how closely the agent's trajectories align with prior successful strategies. This utility shapes the advantage function, providing the critic with additional guidance without altering the reward. Our method relies primarily on offline input and only occasional online…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Topic Modeling
