TL;DR
This paper introduces a new reasoning tree-based metric and scheduling algorithm for reinforcement learning with verifiable rewards to improve large language model performance on math reasoning tasks.
Contribution
It proposes the Reasoning Score (r-score) and Re-Schedule algorithm, leveraging reasoning tree structures for more effective data scheduling in RLVR.
Findings
Re-Schedule improves average accuracy by up to 3.2% on six benchmarks.
Structural reasoning tree understanding enhances RLVR data scheduling.
The approach demonstrates significant gains over path-based metrics.
Abstract
Using Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Large Language Models (LLMs) can be conceptualized as progressively editing a query's `Reasoning Tree'. This process involves exploring nodes (tokens) and dynamically modifying the model's policy at each node. When combined with data scheduling, this process yields further gains in data efficiency and accuracy. However, existing RLVR data scheduling methods typically rely on path-based metrics to rank queries, overlooking the reasoning tree structures of these queries. In this paper, we introduce a novel metric, namely Reasoning Score (r-score), which measures the query's learning difficulty based on the structure of its reasoning tree. Based on the r-score, we propose the Reasoning Tree Schedule (Re-Schedule), a scheduling algorithm that constructs a curriculum progressing from structurally simple (high r-score) to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
