From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

Yuqiao Tan; Minzheng Wang; Bo Liu; Zichen Liu; Tian Liang; Shizhu He; Jun Zhao; Kang Liu

arXiv:2604.14142·cs.LG·April 16, 2026

From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu

PDF

1 Repo

TL;DR

This paper introduces PreRL and DSRL, novel reinforcement learning methods that optimize reasoning in pre-train space, significantly improving reasoning capabilities and exploration in large language models.

Contribution

It proposes a new RL approach in pre-train space with negative sample reinforcement and a dual space strategy, enhancing reasoning and exploration in language models.

Findings

01

NSR-PreRL increases reasoning transition and reflection by over 14 and 6 times.

02

DSRL outperforms strong baselines in reasoning tasks.

03

Pre-train space pruning effectively guides policy toward correct reasoning subspace.

Abstract

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

trae1oung/Pretrain_Space_RLVR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.