TL;DR
DARE is a reinforcement learning framework that adaptively estimates difficulty and co-evolves with the policy to improve training efficiency, response conciseness, and correctness across tasks.
Contribution
It introduces a unified, difficulty-aware RL method that co-evolves difficulty estimation with the policy using importance sampling and adaptive strategies.
Findings
Dare outperforms existing methods in training efficiency and final effectiveness.
It produces more concise responses on easy tasks and improves correctness on hard ones.
The approach is validated across multiple models and domains.
Abstract
Reinforcement learning improves the reasoning ability of large language models but remains costly and sample-inefficient, as many rollouts provide weak learning signals. Difficulty-aware data selection methods attempt to address this by prioritizing moderately difficult prompts, yet our analysis reveals three limitations: difficulty estimates become inaccurate under policy drift, data selection alone yields limited final-performance gains, and inference efficiency remains largely unchanged. These findings suggest that efficient and effective RL requires more than filtering by difficulty: the policy should learn to solve hard tasks while producing concise responses for easy ones. To this end, we propose **Dare**, a unified framework that co-evolves difficulty estimation with the policy via self-normalized importance sampling, maintains diverse difficulty coverage through a symmetric Beta…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
