Extracting Heuristics from Large Language Models for Reward Shaping in   Reinforcement Learning

Siddhant Bhambri; Amrita Bhattacharjee; Durgesh Kalwar; Lin Guan; Huan; Liu; Subbarao Kambhampati

arXiv:2405.15194·cs.LG·October 10, 2024

Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning

Siddhant Bhambri, Amrita Bhattacharjee, Durgesh Kalwar, Lin Guan, Huan, Liu, Subbarao Kambhampati

PDF

Open Access

TL;DR

This paper explores using Large Language Models to generate heuristics for reward shaping in reinforcement learning, significantly improving sample efficiency across various domains by leveraging LLMs with or without a verifier.

Contribution

It introduces a novel method of extracting heuristics from LLMs to construct reward shaping functions, enhancing RL sample efficiency in sparse reward environments.

Findings

01

LLM-generated heuristics improve RL sample efficiency

02

Verifiers help assess heuristic quality

03

Significant gains across multiple RL algorithms and domains

Abstract

Reinforcement Learning (RL) suffers from sample inefficiency in sparse reward domains, and the problem is further pronounced in case of stochastic transitions. To improve the sample efficiency, reward shaping is a well-studied approach to introduce intrinsic rewards that can help the RL agent converge to an optimal policy faster. However, designing a useful reward shaping function for all desirable states in the Markov Decision Process (MDP) is challenging, even for domain experts. Given that Large Language Models (LLMs) have demonstrated impressive performance across a magnitude of natural language tasks, we aim to answer the following question: `Can we obtain heuristics using LLMs for constructing a reward shaping function that can boost an RL agent's sample efficiency?' To this end, we aim to leverage off-the-shelf LLMs to generate a plan for an abstraction of the underlying MDP. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsA2C · Q-Learning · Entropy Regularization · Proximal Policy Optimization