Uncertainty-aware Reward Design Process
Yang Yang, Xiaolu Zhou, Bosong Ding, Miao Xin

TL;DR
This paper introduces URDP, a framework that combines large language models and Bayesian optimization to improve the efficiency and quality of reward function design in reinforcement learning, reducing computational costs and enhancing reward effectiveness.
Contribution
URDP integrates uncertainty quantification and bi-level optimization to automate and improve reward design, addressing limitations of previous methods.
Findings
URDP produces higher-quality reward functions across diverse tasks.
It significantly reduces the computational resources needed for reward design.
URDP outperforms existing reward engineering approaches in efficiency and effectiveness.
Abstract
Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging process due to the inefficiencies and inconsistencies inherent in conventional reward engineering methodologies. Recent advances have explored leveraging large language models (LLMs) to automate reward function design. However, their suboptimal performance in numerical optimization often yields unsatisfactory reward quality, while the evolutionary search paradigm demonstrates inefficient utilization of simulation resources, resulting in prohibitively lengthy design cycles with disproportionate computational overhead. To address these challenges, we propose the Uncertainty-aware Reward Design Process (URDP), a novel framework that integrates large language models to streamline reward function design and evaluation in RL environments. URDP quantifies candidate reward function…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
