Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

Zhen Wang; Zhifeng Gao; Guolin Ke

arXiv:2511.17473·cs.CL·November 24, 2025

Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards

Zhen Wang, Zhifeng Gao, Guolin Ke

PDF

Open Access

TL;DR

This paper introduces MR-RLVR, a novel self-supervised reinforcement learning approach that leverages process-level signals to improve mathematical reasoning in language models, especially when only outcomes are verifiable.

Contribution

It proposes a process-aware self-supervised training method with masked-then-fill and step reordering, enhancing RLVR for mathematical reasoning tasks.

Findings

01

Achieves +9.86% Pass@1 improvement over original RLVR

02

Demonstrates effectiveness on multiple mathematical benchmarks

03

Enhances scalability and performance in outcome-verifiable settings

Abstract

Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Model Reduction and Neural Networks · Machine Learning in Materials Science