Zero Reinforcement Learning Towards General Domains
Yuyuan Zeng, Yufei Huang, Can Xu, Qingfeng Sun, Jianfeng Yan, Guanghui Xu, Tao Yang, Fengzong Lian

TL;DR
This paper introduces a novel zero reinforcement learning paradigm that enhances reasoning abilities of language models across verifiable and non-verifiable domains by combining verifiable rewards with a generative reward model and employing multi-task training.
Contribution
The paper proposes a new zero-RL approach that transfers reasoning skills across diverse domains using combined rewards and multi-task training, addressing the challenge of non-verifiable reasoning tasks.
Findings
Achieves superior reasoning performance on both reasoning-intensive and general tasks.
Demonstrates effectiveness on Qwen3-8B-Base and Qwen3-14B-Base models.
Introduces a smooth length penalty to reduce reward hacking in generative reward models.
Abstract
Zero Reinforcement Learning (Zero-RL) has proven to be an effective approach for enhancing the reasoning capabilities of large language models (LLMs) by directly applying reinforcement learning with verifiable rewards on pretrained models, without the need for a supervised fine-tuning phase. However, current research on zero-RL primarily focuses on domains with easily verifiable reward signals, such as mathematics, programming, and other reasoning tasks. The challenge of eliciting reasoning abilities in more diverse scenarios, where verification is not straightforward, remains underexplored. To address this gap, we propose a novel zero-RL paradigm designed to improve a model's reasoning ability across both verifiable and non-verifiable domains. By combining verifiable rewards with a generative reward model, we conduct multi-task zero-RL training across both domains, facilitating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
