Reinforcement Pre-Training
Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei

TL;DR
Reinforcement Pre-Training (RPT) introduces a scalable RL-based approach to enhance language models by framing next-token prediction as a reasoning task, leading to improved accuracy and a strong foundation for further RL fine-tuning.
Contribution
RPT presents a novel scaling paradigm that leverages RL for language model pre-training, emphasizing reasoning and reward-based training on large text datasets.
Findings
RPT significantly improves next-token prediction accuracy.
Scaling compute consistently enhances model performance.
RPT provides a robust foundation for reinforcement fine-tuning.
Abstract
In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
