Auditing Data Membership in Reinforcement Learning With Verifiable Rewards
Yule Liu, Heyi Zhang, Jinyi Zheng, Zhen Sun, Zifan Peng, Jiaheng Wei, Tianshuo Cong, Yilong Yang, Xinlei He

TL;DR
This paper introduces DIBA, a white-box auditing framework for RLVR that detects data exposure by analyzing behavioral shifts in models, outperforming existing likelihood-based methods.
Contribution
The paper proposes DIBA, a novel query-level auditing method for RLVR that leverages behavioral traces to detect data membership, addressing limitations of prior fixed-string detection techniques.
Findings
DIBA achieves around 0.8 AUC in white-box settings.
DIBA outperforms likelihood-based baselines significantly.
Auditing effectiveness varies with prompt-specific traces and model performance.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a core training stage in recent large language models (LLMs). Its reliance on non-public, high-value prompt sets raises concerns about unauthorized data use, creating a need for exposure auditing. A natural tool is membership inference attacks (MIAs), but existing methods detect fitting to a fixed target string. This does not apply to RLVR, which generates responses from the model itself and reinforces successful ones, thus hindering the auditing of data exposure. We show that it remains detectable: RLVR reshapes the model's response distribution on training prompts, producing behavioral traces that can be surfaced through targeted auditing. We propose Divergence-in-Behavior Auditing (DIBA), a white-box query-level auditing framework for RLVR. DIBA compares a fine-tuned model against its pre-RLVR checkpoint along two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
