OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

Zengzhi Wang; Fan Zhou; Xuefeng Li; Pengfei Liu

arXiv:2506.20512·cs.CL·June 26, 2025

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu

PDF

Open Access 1 Repo 10 Models 1 Datasets

TL;DR

This paper investigates how mid-training strategies and data quality influence reinforcement learning performance in language models, introducing OctoThinker, which achieves improved RL compatibility through a novel two-stage training approach.

Contribution

It presents a new two-stage mid-training strategy, Stable-then-Decay, that enhances RL performance and introduces OctoThinker models with better RL alignment and reasoning capabilities.

Findings

01

High-quality math corpora improve RL performance.

02

Adding chain-of-thought data enhances reasoning and RL outcomes.

03

Scaling mid-training improves downstream RL performance.

Abstract

Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect;…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gair-nlp/octothinker
pytorchOfficial

Models

Datasets

OctoThinker/MegaMath-Web-Pro-Max
dataset· 4.4k dl
4.4k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics

MethodsLLaMA · Balanced Selection