Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Kai Yan; Alexander G. Schwing; Yu-Xiong Wang

arXiv:2605.15012·cs.LG·May 15, 2026

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Kai Yan, Alexander G. Schwing, Yu-Xiong Wang

PDF

1 Repo 1 Datasets

TL;DR

FEST is a novel few-shot demonstration-guided RLVR method that significantly improves sample efficiency in training large language models by using only 128 randomly selected demonstrations.

Contribution

It introduces FEST, which achieves strong performance with minimal demonstration data, reducing the need for extensive supervised fine-tuning datasets.

Findings

01

FEST outperforms baselines with less demonstration data.

02

FEST matches performance of models trained on full datasets.

03

Key components include supervised and on-policy signals with decaying weights.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kaiyan289/FEST
github

Datasets

aoiandroid/papers
dataset· 28 dl
28 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.