TL;DR
ROSE introduces semantically diverse exploration strategies to improve reasoning diversity and efficiency in reinforcement learning for large language models, validated on mathematical reasoning benchmarks.
Contribution
It proposes a novel semantic-entropy-based branching and epsilon-greedy exploration to enhance reasoning diversity and efficiency in RL-based LLM reasoning.
Findings
ROSE improves reasoning diversity and efficiency on mathematical benchmarks.
Semantic-entropy-based branching captures semantic uncertainty effectively.
Length-aware advantage estimator rewards concise, correct reasoning.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs). Monte Carlo Tree Search (MCTS)-based extensions improve upon vanilla RLVR (e.g., GRPO) by providing tree-based reasoning rollouts that enable fine-grained and segment-level credit assignment. However, existing methods still suffer from limited exploration diversity and inefficient reasoning. To address the above challenges, we propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs. To encourage more diverse reasoning exploration, our method incorporates a semantic-entropy-based branching strategy and an -exploration mechanism. The former operates on already sampled reasoning rollouts to capture semantic uncertainty and select branching points with high semantic divergence to generate new successive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
