Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning

Shentong Mo

arXiv:2604.13504·cs.LG·April 16, 2026

Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning

Shentong Mo

PDF

TL;DR

The paper introduces CoUR, a framework using large language models to improve reward function design in reinforcement learning, reducing evaluation costs and increasing robustness.

Contribution

It presents a novel reward design framework that leverages LLMs, code uncertainty quantification, and Bayesian optimization to enhance efficiency and performance.

Findings

01

CoUR achieves better performance across multiple RL environments.

02

It significantly reduces the cost of reward evaluations.

03

The framework demonstrates robustness in reward function design.

Abstract

Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging and labor-intensive process due to the inefficiencies and inconsistencies inherent in traditional methods. Existing methods often rely on extensive manual design and evaluation steps, which are prone to redundancy and overlook local uncertainties at intermediate decision points. To address these challenges, we propose the Chain of Uncertain Rewards (CoUR), a novel framework that integrates large language models (LLMs) to streamline reward function design and evaluation in RL environments. Specifically, our CoUR introduces code uncertainty quantification with a similarity selection mechanism that combines textual and semantic analyses to identify and reuse the most relevant reward function components. By reducing redundant evaluations and leveraging Bayesian optimization on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.