Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning
Siddhant Bhambri, Amrita Bhattacharjee, Durgesh Kalwar, Lin Guan, Huan, Liu, Subbarao Kambhampati

TL;DR
This paper explores using Large Language Models to generate heuristics for reward shaping in reinforcement learning, significantly improving sample efficiency across various domains by leveraging LLMs with or without a verifier.
Contribution
It introduces a novel method of extracting heuristics from LLMs to construct reward shaping functions, enhancing RL sample efficiency in sparse reward environments.
Findings
LLM-generated heuristics improve RL sample efficiency
Verifiers help assess heuristic quality
Significant gains across multiple RL algorithms and domains
Abstract
Reinforcement Learning (RL) suffers from sample inefficiency in sparse reward domains, and the problem is further pronounced in case of stochastic transitions. To improve the sample efficiency, reward shaping is a well-studied approach to introduce intrinsic rewards that can help the RL agent converge to an optimal policy faster. However, designing a useful reward shaping function for all desirable states in the Markov Decision Process (MDP) is challenging, even for domain experts. Given that Large Language Models (LLMs) have demonstrated impressive performance across a magnitude of natural language tasks, we aim to answer the following question: `Can we obtain heuristics using LLMs for constructing a reward shaping function that can boost an RL agent's sample efficiency?' To this end, we aim to leverage off-the-shelf LLMs to generate a plan for an abstraction of the underlying MDP. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsA2C · Q-Learning · Entropy Regularization · Proximal Policy Optimization
