Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

Haechan Kim; Soohyun Ryu; Gyouk Chu; Doohyuk Jang; Eunho Yang

arXiv:2603.18444·cs.LG·March 20, 2026

Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

Haechan Kim, Soohyun Ryu, Gyouk Chu, Doohyuk Jang, Eunho Yang

PDF

Open Access

TL;DR

This paper introduces a novel reward estimation method for reinforcement learning with verifiable rewards, reducing variance and improving sample efficiency in large language model reasoning tasks.

Contribution

It proposes Discounted Beta--Bernoulli (DBB) reward estimation, a biased but variance-stabilizing approach that enhances reward distribution estimation from limited data.

Findings

01

DBB reduces estimation variance and avoids variance collapse.

02

Experiments show significant accuracy improvements on multiple benchmarks.

03

Method improves sample efficiency without extra computational cost.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we propose Discounted Beta--Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Reinforcement Learning in Robotics