TL;DR
SUPERNOVA introduces a data curation framework for reinforcement learning with verifiable rewards, significantly enhancing large language models' general reasoning capabilities across diverse tasks.
Contribution
It presents a systematic approach to adapt expert-annotated instruction datasets for RLVR, improving reasoning performance and providing practical data curation insights.
Findings
Models trained on SUPERNOVA outperform baselines on reasoning benchmarks.
Task source selection significantly impacts reasoning performance.
Training on SUPERNOVA yields up to 52.8% improvement on BBEH.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
