Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement

Yunjian Zhang; Sudong Wang; Yang Li; Peiran Xu; Conghao Zhou; Xiaoyue Ma; Jianing Li; Yao Zhu

arXiv:2602.00815·cs.AI·May 5, 2026

Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement

Yunjian Zhang, Sudong Wang, Yang Li, Peiran Xu, Conghao Zhou, Xiaoyue Ma, Jianing Li, Yao Zhu

PDF

TL;DR

This paper introduces DoPR, a resource-efficient reinforcement learning method for large language models that reduces training costs by dynamically selecting informative samples, maintaining reasoning performance.

Contribution

It establishes a theoretical lower bound on sample complexity for reasoning in LLMs and proposes DoPR, a novel uncertainty-aware strategy that significantly reduces rollout overhead.

Findings

01

DoPR reduces rollout overhead by nearly tenfold.

02

Strong reasoning performance achieved with fewer training instances.

03

Theoretical lower bounds inform efficient training strategies.

Abstract

Large language models (LLMs) have exhibited remarkable performance on complex reasoning tasks, with reinforcement learning under verifiable rewards (RLVR) emerging as a principled framework for aligning model behavior with reasoning chains. Despite its promise, RLVR remains prohibitively resource-intensive, requiring extensive reward signals and incurring substantial rollout costs during training. In this work, we revisit the fundamental question of data and compute efficiency in RLVR. We first establish a theoretical lower bound on the sample complexity required to unlock reasoning capabilities, and empirically validate that strong performance can be achieved with a surprisingly small number of training instances. To tackle the computational burden, we propose Dynamic One-Shot Policy Refinement (DoPR), an uncertainty-aware RL strategy that dynamically selects a single informative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.