Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming
Qianfan Zhang, Tianyu Guo, Xuandi Ren, Jiale Chen, Ming Ding, Ran Xin, Xia Xiao

TL;DR
This paper explores methods to enhance reasoning token efficiency in competitive programming models through reinforcement learning and parallel thinking, achieving significant performance improvements.
Contribution
It introduces a multi-round parallel thinking pipeline and training strategies that effectively scale reasoning tokens, outperforming existing models on challenging problems.
Findings
Log-linear relationship between accuracy and reasoning tokens during RL training.
Verification RL warmup and randomized clipping improve training trajectories.
The full system surpasses GPT-5-high on 456 hard problems with fewer tokens.
Abstract
We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
