Data-Efficient RLVR via Off-Policy Influence Guidance

Erle Zhu; Dazhi Jiang; Yuan Wang; Xujun Li; Jiale Cheng; Yuxian Gu; Yilin Niu; Aohan Zeng; Jie Tang; Minlie Huang; Hongning Wang

arXiv:2510.26491·cs.LG·April 16, 2026

Data-Efficient RLVR via Off-Policy Influence Guidance

Erle Zhu, Dazhi Jiang, Yuan Wang, Xujun Li, Jiale Cheng, Yuxian Gu, Yilin Niu, Aohan Zeng, Jie Tang, Minlie Huang, Hongning Wang

PDF

TL;DR

This paper introduces CROPI, a theoretically-grounded, influence-based data selection method for RLVR that accelerates training of large language models by efficiently identifying the most impactful data points.

Contribution

The work presents a novel off-policy influence estimation technique combined with sparse random projection, enabling efficient data selection for RLVR with large language models.

Findings

01

CROPI accelerates training by 2.66x on a 1.5B model.

02

Uses only 10% of data per stage for comparable performance.

03

Significantly reduces training time while maintaining effectiveness.

Abstract

Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.