Towards Better RL Training Data Utilization via Second-Order Rollout
Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui

TL;DR
This paper introduces second-order rollout in reinforcement learning to enhance training data utilization by jointly training generation and critique capabilities, leading to improved model performance.
Contribution
It proposes a unified framework for combining generation and critique training in RL, addressing limitations of first-order rollout methods.
Findings
Second-order rollout improves data efficiency over vanilla RL.
Critique training effectiveness depends on label balance.
Sampling techniques mitigate noise in outcome-based rewards.
Abstract
Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple responses for a question), and we argue that this approach fails to fully exploit the potential of training data because of the neglect of critique capability training. To tackle this problem, we further introduce the concept of second-order rollout (generating multiple critiques for a response) and propose a unified framework for jointly training generation and critique capabilities. Extensive experiments across various models and datasets demonstrate that our approach can utilize training data more effectively than vanilla RL and achieve better performance under the same training data. Additionally, we uncover several insightful findings regarding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Sentiment Analysis and Opinion Mining
