True Knowledge Comes from Practice: Aligning LLMs with Embodied Environments via Reinforcement Learning
Weihao Tan, Wentao Zhang, Shanqi Liu, Longtao Zheng, Xinrun Wang, Bo, An

TL;DR
This paper introduces TWOSOME, a framework that uses large language models as decision-making agents in reinforcement learning environments, improving sample efficiency, generalization, and maintaining original capabilities.
Contribution
TWOSOME is a novel online framework that aligns LLMs with embodied environments using RL, featuring a parameter-efficient training architecture and enhanced policy stability.
Findings
TWOSOME outperforms PPO and SayCan in decision-making tasks.
It demonstrates superior generalization to unseen tasks.
It preserves LLMs' original abilities during training.
Abstract
Despite the impressive performance across numerous tasks, large language models (LLMs) often fail in solving simple decision-making tasks due to the misalignment of the knowledge in LLMs with environments. On the contrary, reinforcement learning (RL) agents learn policies from scratch, which makes them always align with environments but difficult to incorporate prior knowledge for efficient explorations. To narrow the gap, we propose TWOSOME, a novel general online framework that deploys LLMs as decision-making agents to efficiently interact and align with embodied environments via RL without requiring any prepared datasets or prior knowledge of the environments. Firstly, we query the joint probabilities of each valid action with LLMs to form behavior policies. Then, to enhance the stability and robustness of the policies, we propose two normalization methods and summarize four prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
MethodsEntropy Regularization · ALIGN · Proximal Policy Optimization
