Incentivizing In-depth Reasoning over Long Contexts with Process Advantage Shaping
Miao Peng, Weizhou Shen, Nuo Chen, Chenliang Li, Ming Yan, Jia Li

TL;DR
This paper introduces DeepReasonQA and LongPAS, novel methods that improve long-context reasoning in LLMs by synthesizing challenging data and fine-grained credit assignment, significantly enhancing performance over existing RLVR approaches.
Contribution
The paper presents a new framework for generating difficult multi-hop QA data and a process advantage shaping method that improves long-context reasoning in LLMs, addressing the 'almost-there' phenomenon.
Findings
Outperforms RLVR baselines on long-context reasoning benchmarks.
Matches frontier LLMs with fewer parameters.
Strengthens reasoning capabilities while maintaining stable training.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs short-context reasoning, but its performance degrades in long-context scenarios that require both precise grounding and robust long-range reasoning. We identify the "almost-there" phenomenon in long-context reasoning, where trajectories are largely correct but fail at the final step, and attribute this failure to two factors: (1) the lack of high reasoning density in long-context QA data that push LLMs beyond mere grounding toward sophisticated multi-hop reasoning; and (2) the loss of valuable learning signals during long-context RL training due to the indiscriminate penalization of partially correct trajectories with incorrect outcomes. To overcome this bottleneck, we propose DeepReasonQA, a KG-driven synthesis framework that controllably constructs high-difficulty, multi-hop long-context QA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics · Machine Learning in Healthcare
