Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent
Ziyang Huang, Xiaowei Yuan, Yiming Ju, Jun Zhao, Kang Liu

TL;DR
This paper presents IKEA, an adaptive search agent that intelligently balances internal and external knowledge retrieval using reinforcement learning, leading to more accurate, efficient, and robust reasoning in large language models.
Contribution
Introduces a novel reinforcement learning framework with a knowledge-boundary aware reward for synergistic internal-external knowledge reasoning in LLMs.
Findings
IKEA outperforms baseline methods in multiple reasoning tasks.
Reduces retrieval frequency significantly.
Demonstrates robust generalization capabilities.
Abstract
Retrieval-augmented generation (RAG) is a common strategy to reduce hallucinations in Large Language Models (LLMs). While reinforcement learning (RL) can enable LLMs to act as search agents by activating retrieval capabilities, existing ones often underutilize their internal knowledge. This can lead to redundant retrievals, potential harmful knowledge conflicts, and increased inference latency. To address these limitations, an efficient and adaptive search agent capable of discerning optimal retrieval timing and synergistically integrating parametric (internal) and retrieved (external) knowledge is in urgent need. This paper introduces the Reinforced Internal-External Knowledge Synergistic Reasoning Agent (IKEA), which could indentify its own knowledge boundary and prioritize the utilization of internal knowledge, resorting to external search only when internal knowledge is deemed…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper addresses an important challenge in RAG systems: when to retrieve rather than how to retrieve. 2. The reward formulation is intuitive and well-motivated, in that it attempts to explicitly model “knowledge boundary.” 3. Results demonstrate that the proposed framework can reduce retrieval calls while maintaining or improving answer correctness. 4. The paper is generally well-written and follows a clear structure.
1. The data construction pipeline (line 245) suggests a much simpler alternative: Train a difficulty classifier to decide whether a question requires retrieval. This is directly in line with Adaptive-RAG: Learning to Adapt Retrieval-Augmented LLMs through Question Complexity. Such baselines are not included, leaving it unclear whether RL is actually necessary — or whether the same behavior can be achieved more simply and without retraining the model. In particular, for practical deployment (e.g.
- Directly optimizes “when to search.” The reward prefers “correct with fewer RT” over “correct with more RT,” and prefers “searched but wrong” over “did not search and wrong,” which encodes a clear ordering of behaviors (1>3>4>2 in the paper’s terms). This is a simple, targeted shaping for the retrieval timing problem rather than a classifier-gated or imitation-heavy approach. - Balanced difficulty splits (Qeasy/Qhard) are used not just for evaluation but as a training prior to keep the poli
- Hyperparameter sensitivity. The balance between accuracy and search frequency depends on RTmax, rkb+, and rkb−. Without concrete values and sweeps, it is hard to know how portable the setting is across model sizes and retrievers. - Labeling procedure. The Qeasy/Qhard split is defined by the base model’s own probe with N samples. This can drift with model choice and N; it would be helpful to report label distributions and sensitivity to N. - Corpus constraint. Experiments fix Wikipedia201
1. Building upon previous methods such as Search-R1, this paper introduces an exploration of the balance between internal and external knowledge paths. This not only empowers the model with the ability to actively explore retrieval but also enables it to learn the boundary between its internal parametric knowledge and external knowledge. 2. The baseline experimental comparison is thorough, and the writing is clear.
1. The novelty of this paper is limited. It appears to merely integrate the exploration path strategy for balancing internal and external knowledge from DeepRAG into the Search-R1 framework. Essentially, it only designs a more complex reward function based on Search-R1, incorporating four different key behaviors, and is essentially an extension of DeepSearch-like works. 2. The paper lacks appropriate experimental design to support the claim that its method is indeed more effective in learning th
1. Proposes a reinforcement-learning-based framework for internal–external knowledge synergistic reasoning. 2. The reward function is well-designed, explicitly modeling the trade-off between retrieval cost and answer accuracy. 3. Achieves stable performance improvements across multiple datasets.
1. Limited generalization on Hard/OOD scenarios: Although IKEA aims to trigger retrieval when internal knowledge is insufficient, its improvements over Search-R1 on Hard/OOD datasets (e.g., PopQA and 2Wiki) are modest or even negative, suggesting that the model still tends to over-rely on internal knowledge in out-of-distribution settings. 2. Mismatch between model scale and training data: Models of different sizes have distinct distributions of internal parametric knowledge. If the same trainin
Addresses a practical issue in Retrieval-Augmented Generation (RAG) systems: over-reliance on retrieval even when internal knowledge is sufficient, which leads to increased latency and potential knowledge conflicts. The design of knowledge boundary-aware rewards is intuitive and reasonable, with clear hierarchical structure for behavioral preferences. Evaluated on multiple datasets, including both in-distribution and out-of-distribution tests.
1. Limited technical innovation. Its reward shaping strategy for existing Reinforcement Learning (RL) methods essentially only adopts the "Search-R1" framework combined with reward function construction. 2. The comparison scope for baseline model construction is limited and not fully expanded. 3. The 1:1 ratio of easy-to-hard questions lacks theoretical basis, and no ablation experiments have been conducted under different ratios. 4. Lacks a formal and rigorous definition of the "knowledge bound
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Information Retrieval and Search Behavior
MethodsAttentive Walk-Aggregating Graph Neural Network
