LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts
Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, Mao Yang

TL;DR
LoongRL introduces a reinforcement learning approach with KeyChain data synthesis to enhance large language models' reasoning over extremely long contexts, achieving significant accuracy improvements and emergent reasoning patterns.
Contribution
The paper presents LoongRL, a novel RL method with KeyChain data that enables models to perform advanced long-context reasoning, generalizing beyond training lengths.
Findings
Substantial accuracy improvements on long-context QA tasks.
Emergent plan-retrieve-reason-recheck reasoning pattern.
Models effectively handle 128K tasks without full-length RL costs.
Abstract
Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing "Aha" moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck…
Peer Reviews
Decision·ICLR 2026 Oral
- This work proposes an interesting approach of synthetic data construction for training a language model for long-context reasoning. Basic idea is to insert distractors both for contexts and questions so that a model has to pay attention not only the correct context, but correct question at the same time. It is interesting that a model trained only on "shorter" context, i.e., 16K, can scale to 128K contexts. - The idea of inserting key-value pairs is quite interesting so that a model has to tra
- The motivation is not clear why UUIDs are used as keys. There exist alternatives, e.g., entity names, or other random strings, could be possible. - The detail settings are missing, e.g., the number of distractor contexts and questions, inserted to construct the synthetic dataset. It is also not clear whether the distractor questions are related to the irrelevant contexts already inserted in the long context filling step.
The paper addresses an important open problem: reasoning over long contexts beyond basic retrieval. By enabling RL to target nontrivial but verifiable long-context reasoning problems, the KeyChain dataset overcomes a key bottleneck of long-context RL finetuning. The results indicate strong empirical performance gains on long-context benchmarks without regressing on the short-context reasoning benchmarks considered, however the latter is expected given the inclusion of short-context reasoning tas
The KeyChain data construction is highly task-specific: synthetic multi-hop QA with UUID breadcrumbs. It is unclear from the current results if the learned reasoning behaviour generalises to other domains, such as open-ended dialogue, summarisation, or multi-document synthesis. There is no evaluation on long-context generation tasks, which has been artificially decoupled from reasoning over long, static input contexts. A major claim is the emergence of a general reasoning pattern for long-contex
- Clear, focused objective and thoughtful problem decomposition: - Tackles a real gap: moving beyond retrieval to robust long-context reasoning. - Designs data and training to be verifiable and compute-aware (shorter rollouts, longer generalization). - Novel and pragmatic data construction: - KeyChain is a neat way to force chain tracing and disambiguation under heavy distractors, requiring both retrieval and reasoning. - Uses real QA seeds (HotpotQA, MuSiQue, 2Wiki) to ground tasks in n
- Synthetic structure risk: - The KeyChain format (UUID chains with a designated starting key) has a highly regular, explicit structure. There is a risk the model learns to exploit these patterns rather than developing general long-context reasoning skills. Although downstream improvements suggest transfer, additional tests against structurally varied chains would better establish robustness. - Reward and evaluation concerns: - Two-way substring match is pragmatic but may still admit false p
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
