RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses
Feiyu Wu, Xu Zheng, Zhuocheng Wang, Yi ming Dai, Hui Li

TL;DR
RHyVE is a protocol for verifying and deploying reward hypotheses in reinforcement learning with LLMs, considering policy competence and training phase, to improve reward reliability and policy performance.
Contribution
The paper introduces RHyVE, a competence-aware, phase-aware verification protocol for reward hypotheses, addressing deployment timing and reliability in LLM-based reward design.
Findings
Reward rankings become more reliable after certain training thresholds.
Phase-aware deployment improves performance in sparse manipulation tasks.
Reward candidate pools can change winners depending on training phase.
Abstract
Large language models (LLMs) make reward design in reinforcement learning substantially more scalable, but generated rewards are not automatically reliable training objectives. Existing work has focused primarily on generating, evolving, or selecting reward candidates, while paying less attention to when such candidates can be verified and deployed during policy optimization. We study this deployment-time problem by treating generated rewards as reward hypotheses whose utility depends on the competence of the current policy and the phase of training. We propose \textsc{RHyVE}, a competence-aware verification and phase-aware deployment protocol that compares small sets of reward hypotheses from shared policy checkpoints using short-horizon fork verification. Our experiments show that reward rankings are unreliable at low competence but become informative after task-dependent thresholds.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
