TL;DR
This paper introduces Knowledgeable-R1, a reinforcement learning framework that enhances retrieval-augmented generation models by enabling them to better resist irrelevant or conflicting external context through the use of parametric knowledge, improving robustness and reasoning accuracy.
Contribution
It presents a novel reinforcement learning approach with joint sampling and advantage transformation to improve model robustness against contextual interference in RAG tasks.
Findings
Significantly improves robustness in counterfactual scenarios (+22.89%)
Enhances reasoning accuracy under knowledge conflict
Maintains performance when retrieved context is accurate
Abstract
Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors. We propose Knowledgeable-R1, a reinforcement-learning framework that explicitly trains large language models to use parametric knowledge (PK) to resist contextual interference while still exploiting external context when it is reliably helpful. Knowledgeable-R1 introduces a joint sampling scheme that generates paired responses with and without retrieval, and learns both local advantages (within each decoding regime) and global advantages under the same input to quantify when to ignore misleading context versus adopt it. We employ an asymmetric advantage transformation that amplifies exploratory behaviors toward parametric knowledge. Experiments show that…
Peer Reviews
Decision·ICLR 2026 Poster
The paper tackles “context dominance” in RAG and designs RL signals that explicitly arbitrate between CK and PK at the same input state (global advantage), which is the right granularity for conflict resolution. The three‑policy setup is well‑motivated and neatly summarized The combination of local vs. global advantages and an asymmetric transformation for RPK is a coherent way to reward using context when both sources agree or CK is right (timeliness) and protect PK exploration when CK is wrong
Retrieval details are not clear, like what retriever, indexing corpus, and query formulations were used. The headline gains mostly compare to RAG prompting (e.g., +30.47/+29.28/+18.09 on Qwen2.5-7B NC-MR/MC/QA; Table 2). Against GRPO w/ RAG, improvements are smaller (e.g., 43.94 vs 26.94 = +17.00). How the “23% over GRPO” figure is aggregated should be clarified. The paper relies on exact match accuracy but does not directly quantify context dependence like answer stability when contexts are
The paper squarely targets the challenge of contextual interference—conflicts between context prompts and parametric knowledge—and proposes Knowledgeable-R1, which improves robustness via joint sampling, local/global advantage design, and an asymmetric advantage transformation (reward shaping).
1. Lack Finetuning baselines. Please include finetuning methods cited in Related Work (e.g., Self-RAG, InFO-RAG) as baselines for a fair comparison. 2. β scheduling. Compare fixed β settings ({0.2, 0.5, 0.8, 1.0}) versus the adaptive β scheme to demonstrate the necessity of adaptivity. 3. Advantage composition. Validate whether combining local + global advantages is necessary (e.g., ablations varying their relative weights). 4. S1 performance gap. In settings with correct context (S1), performan
1. The paper addresses a highly practical and critical issue in Retrieval-Augmented Generation (RAG) systems: "context interference" or "Context Dominance". Specifically, when Large Language Models (LLMs) encounter retrieved context that is erroneous, irrelevant, or conflicting, they tend to over-rely on this context while ignoring their internal, more accurate Parametric Knowledge (PK). 2. It designs a multi-objective Reinforcement Learning (RL) framework. The most crucial design within this f
1. The method introduces significant training overhead. It requires maintaining and optimizing three distinct strategic objectives (PK, CK, RPK), calculating complex "local + global" advantages, and finally implementing asymmetric modulation. This is far more complex in both implementation and computation compared to standard Supervised Fine-Tuning (SFT) or GRPO with RAG. 2. As mentioned in Section 3.2 of the paper: "During inference, the model does not explicitly switch controllers; the learne
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLayer Normalization · Linear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Byte Pair Encoding · Softmax · Linear Layer · Dropout · Dense Connections · Attention Is All You Need
