SAC-GLAM: Improving Online RL for LLM agents with Soft Actor-Critic and Hindsight Relabeling
Loris Gaven, Clement Romac, Thomas Carta, Sylvain Lamprier, Olivier Sigaud, Pierre-Yves Oudeyer

TL;DR
This paper introduces SAC-GLAM, an off-policy RL method with hindsight relabeling for LLM agents, enabling more efficient online learning and autonomous goal pursuit in complex environments.
Contribution
It adapts Soft Actor-Critic and hindsight relabeling for LLM agents, enhancing online learning capabilities beyond on-policy methods.
Findings
Outperforms on-policy methods in multi-goal RL environments.
Enables autonomous intrinsically motivated LLM agents.
Facilitates learning of efficient strategies in complex tasks.
Abstract
The past years have seen Large Language Models (LLMs) strive not only as generative models but also as agents solving textual sequential decision-making tasks. When facing complex environments where their zero-shot abilities are insufficient, recent work showed online Reinforcement Learning (RL) could be used for the LLM agent to discover and learn efficient strategies interactively. However, most prior work sticks to on-policy algorithms, which greatly reduces the scope of methods such agents could use for both exploration and exploitation, such as experience replay and hindsight relabeling. Yet, such methods may be key for LLM learning agents, and in particular when designing autonomous intrinsically motivated agents sampling and pursuing their own goals (i.e. autotelic agents). This paper presents and studies an adaptation of Soft Actor-Critic and hindsight relabeling to LLM agents.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Advanced Malware Detection Techniques · Security and Verification in Computing
MethodsExperience Replay
