Improving the Data-efficiency of Reinforcement Learning by Warm-starting with LLM
Thang Duong, Minglai Yang, Chicheng Zhang

TL;DR
This paper presents LORO, a method that leverages Large Language Models to generate high-quality initial datasets, significantly improving the sample efficiency and performance of reinforcement learning in classical MDP environments.
Contribution
The paper introduces LORO, a novel approach that combines LLM-generated data with RL to enhance learning efficiency and policy optimality.
Findings
LORO converges to optimal policies in tested environments.
LORO achieves up to 4 times higher cumulative rewards than baseline methods.
Using LLMs for data initialization improves RL sample efficiency.
Abstract
We investigate the usage of Large Language Model (LLM) in collecting high-quality data to warm-start Reinforcement Learning (RL) algorithms for learning in some classical Markov Decision Process (MDP) environments. In this work, we focus on using LLM to generate an off-policy dataset that sufficiently covers state-actions visited by optimal policies, then later using an RL algorithm to explore the environment and improve the policy suggested by the LLM. Our algorithm, LORO, can both converge to an optimal policy and have a high sample efficiency thanks to the LLM's good starting policy. On multiple OpenAI Gym environments, such as CartPole and Pendulum, we empirically demonstrate that LORO outperforms baseline algorithms such as pure LLM-based policies, pure RL, and a naive combination of the two, achieving up to the cumulative rewards of the pure RL baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Machine Learning and Algorithms · Machine Learning and Data Classification
MethodsFocus
