TL;DR
QwenLong-L1 introduces a reinforcement learning framework that effectively extends large reasoning models to handle long-context inputs, improving performance on document question-answering tasks through progressive training and curriculum strategies.
Contribution
The paper formalizes long-context reasoning RL and proposes QwenLong-L1, a novel framework that enhances short-context LRMs for long-context reasoning via progressive scaling and curriculum-guided RL.
Findings
Outperforms existing LRMs on seven long-context benchmarks.
Achieves performance comparable to state-of-the-art models like Claude-3.7.
Demonstrates robust reasoning capabilities in information-intensive environments.
Abstract
Recent large reasoning models (LRMs) have demonstrated strong reasoning capabilities through reinforcement learning (RL). These improvements have primarily been observed within the short-context reasoning tasks. In contrast, extending LRMs to effectively process and reason on long-context inputs via RL remains a critical unsolved challenge. To bridge this gap, we first formalize the paradigm of long-context reasoning RL, and identify key challenges in suboptimal training efficiency and unstable optimization process. To address these issues, we propose QwenLong-L1, a framework that adapts short-context LRMs to long-context scenarios via progressive context scaling. Specifically, we utilize a warm-up supervised fine-tuning (SFT) stage to establish a robust initial policy, followed by a curriculum-guided phased RL technique to stabilize the policy evolution, and enhanced with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
