LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards
Guanzheng Chen, Michael Qizhe Shieh, Lidong Bing

TL;DR
This paper introduces LongRLVR, a method that enhances reinforcement learning for large language models in long-context tasks by incorporating verifiable context rewards, significantly improving model performance in evidence grounding scenarios.
Contribution
The paper proposes a dense, verifiable context reward to address the sparsity issue in outcome-only rewards, enabling effective learning in long-context reinforcement learning with LLMs.
Findings
LongRLVR outperforms standard RLVR on multiple benchmarks.
Boosts LLM reasoning accuracy in long-context tasks.
Addresses vanishing gradient issues in context grounding.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly…
Peer Reviews
Decision·ICLR 2026 Poster
1. **Timely and impactful problem.** The paper tackles a highly relevant and increasingly important issue — how to perform reinforcement learning effectively in long-context settings. As large-context reasoning becomes central to emerging LLM-based agents and search systems, addressing the credit-assignment and gradient-vanishing challenges identified here is both timely and of broad significance. 2. **Strong motivation, clear formulation, and well-executed methodology.** The study is well motiv
1. **Strong theoretical assumptions.** The analysis relies on several simplifying assumptions that may not fully hold in practice. In particular, it adopts an all-or-nothing reward assumption, where the answer reward increases only when the entire evidence set G is selected. In reality, LLMs often produce correct answers from partial or alternative evidence, making this assumption less realistic. Similarly, the independent Bernoulli selection assumption overlooks dependencies between evidence ch
1. The formal analysis of why the outcome-only reward is insufficient for the long-context retrieval-based task provides some transferable insights. 2. The modulated F-score reward, combining unconditional grounding reward and synergistic success reward, is thoughtfully designed. 3. The paper provides extensive analysis on both synthetic and real-world long-context tasks, and the paper includes thorough ablations examining reward components, data quality, hyperparameters, and chunk number robust
1. The comparison is a bit weak, which hinders the overall soundness of the work. Interleaving reasoning and retrieval is now becoming more popular. I would suggest comparing with some RAG baselines (which do not require RLVR but fit the same scenario), as well as some recent works like [1]. 2. Assumption 1 seems too strong for the analysis. In reality, the reward for retrieved evidence, if applied, should be more continuous than the 0 or 1 sparse reward. Also, the independence assumption for ch
The problem is significant, as RLVR may stimulate hallucinations and render the training process unstable, while its sparse rewards make effective exploration challenging in practice. The paper addresses the issue of vanishing gradient in RLVR under sparse outcome-reward settings, examining its causes and implications. The choice of the F1 score as a reward makes sense to me, since it balances precision and recall rather than encourages the model to cover the evidence as much as possible. Th
However, my concerns arose from the data generation pipeline and the usage of the verifier LLM. 1. It seems that the method is only applicable for the Grounded QA, where evidence can be cleanly chunked. However, in such a case, performing rule-based rewards for the evidence suggestion should be straightforward. The usage of the F1-score is also straightforward to me, since recall encourages the policy to cover as many chunks as possible. 2. A separate verifier LLM is used, which helps identify
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
