OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu

TL;DR
This paper investigates how mid-training strategies and data quality influence reinforcement learning performance in language models, introducing OctoThinker, which achieves improved RL compatibility through a novel two-stage training approach.
Contribution
It presents a new two-stage mid-training strategy, Stable-then-Decay, that enhances RL performance and introduces OctoThinker models with better RL alignment and reasoning capabilities.
Findings
High-quality math corpora improve RL performance.
Adding chain-of-thought data enhances reasoning and RL outcomes.
Scaling mid-training improves downstream RL performance.
Abstract
Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗OctoThinker/OctoThinker-1B-Long-Basemodel· 3 dl3 dl
- 🤗OctoThinker/OctoThinker-1B-Short-Basemodel· 24 dl24 dl
- 🤗OctoThinker/OctoThinker-1B-Hybrid-Basemodel· 76 dl· ♡ 176 dl♡ 1
- 🤗OctoThinker/OctoThinker-3B-Long-Basemodel· 11 dl· ♡ 111 dl♡ 1
- 🤗OctoThinker/OctoThinker-3B-Hybrid-Basemodel· 5.3k dl· ♡ 15.3k dl♡ 1
- 🤗OctoThinker/OctoThinker-3B-Short-Basemodel· 20 dl20 dl
- 🤗OctoThinker/OctoThinker-3B-Long-Zeromodel· 10 dl10 dl
- 🤗OctoThinker/OctoThinker-1B-Short-Zeromodel· 10 dl10 dl
- 🤗OctoThinker/OctoThinker-1B-Hybrid-Zeromodel· 1 dl1 dl
- 🤗OctoThinker/OctoThinker-1B-Long-Zeromodel· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
MethodsLLaMA · Balanced Selection
