Masked-and-Reordered Self-Supervision for Reinforcement Learning from Verifiable Rewards
Zhen Wang, Zhifeng Gao, Guolin Ke

TL;DR
This paper introduces MR-RLVR, a novel self-supervised reinforcement learning approach that leverages process-level signals to improve mathematical reasoning in language models, especially when only outcomes are verifiable.
Contribution
It proposes a process-aware self-supervised training method with masked-then-fill and step reordering, enhancing RLVR for mathematical reasoning tasks.
Findings
Achieves +9.86% Pass@1 improvement over original RLVR
Demonstrates effectiveness on multiple mathematical benchmarks
Enhances scalability and performance in outcome-verifiable settings
Abstract
Test-time scaling has been shown to substantially improve large language models' (LLMs) mathematical reasoning. However, for a large portion of mathematical corpora, especially theorem proving, RLVR's scalability is limited: intermediate reasoning is crucial, while final answers are difficult to directly and reliably verify. Meanwhile, token-level SFT often degenerates into rote memorization rather than inducing longer chains of thought. Inspired by BERT's self-supervised tasks, we propose MR-RLVR (Masked-and-Reordered RLVR), which constructs process-level self-supervised rewards via "masked-then-fill" and "step reordering" to extract learnable signals from intermediate reasoning. Our training pipeline comprises two stages: we first perform self-supervised training on sampled mathematical calculation and proof data; we then conduct RLVR fine-tuning on mathematical calculation datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Model Reduction and Neural Networks · Machine Learning in Materials Science
