Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent

Ziyang Huang; Xiaowei Yuan; Yiming Ju; Jun Zhao; Kang Liu

arXiv:2505.07596·cs.CL·May 13, 2025

Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent

Ziyang Huang, Xiaowei Yuan, Yiming Ju, Jun Zhao, Kang Liu

PDF

Open Access 1 Repo 5 Reviews

TL;DR

This paper presents IKEA, an adaptive search agent that intelligently balances internal and external knowledge retrieval using reinforcement learning, leading to more accurate, efficient, and robust reasoning in large language models.

Contribution

Introduces a novel reinforcement learning framework with a knowledge-boundary aware reward for synergistic internal-external knowledge reasoning in LLMs.

Findings

01

IKEA outperforms baseline methods in multiple reasoning tasks.

02

Reduces retrieval frequency significantly.

03

Demonstrates robust generalization capabilities.

Abstract

Retrieval-augmented generation (RAG) is a common strategy to reduce hallucinations in Large Language Models (LLMs). While reinforcement learning (RL) can enable LLMs to act as search agents by activating retrieval capabilities, existing ones often underutilize their internal knowledge. This can lead to redundant retrievals, potential harmful knowledge conflicts, and increased inference latency. To address these limitations, an efficient and adaptive search agent capable of discerning optimal retrieval timing and synergistically integrating parametric (internal) and retrieved (external) knowledge is in urgent need. This paper introduces the Reinforced Internal-External Knowledge Synergistic Reasoning Agent (IKEA), which could indentify its own knowledge boundary and prioritize the utilization of internal knowledge, resorting to external search only when internal knowledge is deemed…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper addresses an important challenge in RAG systems: when to retrieve rather than how to retrieve. 2. The reward formulation is intuitive and well-motivated, in that it attempts to explicitly model “knowledge boundary.” 3. Results demonstrate that the proposed framework can reduce retrieval calls while maintaining or improving answer correctness. 4. The paper is generally well-written and follows a clear structure.

Weaknesses

1. The data construction pipeline (line 245) suggests a much simpler alternative: Train a difficulty classifier to decide whether a question requires retrieval. This is directly in line with Adaptive-RAG: Learning to Adapt Retrieval-Augmented LLMs through Question Complexity. Such baselines are not included, leaving it unclear whether RL is actually necessary — or whether the same behavior can be achieved more simply and without retraining the model. In particular, for practical deployment (e.g.

Reviewer 02Rating 6Confidence 2

Strengths

- Directly optimizes “when to search.” The reward prefers “correct with fewer RT” over “correct with more RT,” and prefers “searched but wrong” over “did not search and wrong,” which encodes a clear ordering of behaviors (1>3>4>2 in the paper’s terms). This is a simple, targeted shaping for the retrieval timing problem rather than a classifier-gated or imitation-heavy approach. - Balanced difficulty splits (Qeasy/Qhard) are used not just for evaluation but as a training prior to keep the poli

Weaknesses

- Hyperparameter sensitivity. The balance between accuracy and search frequency depends on RTmax, rkb+, and rkb−. Without concrete values and sweeps, it is hard to know how portable the setting is across model sizes and retrievers. - Labeling procedure. The Qeasy/Qhard split is defined by the base model’s own probe with N samples. This can drift with model choice and N; it would be helpful to report label distributions and sensitivity to N. - Corpus constraint. Experiments fix Wikipedia201

Reviewer 03Rating 4Confidence 3

Strengths

1. Building upon previous methods such as Search-R1, this paper introduces an exploration of the balance between internal and external knowledge paths. This not only empowers the model with the ability to actively explore retrieval but also enables it to learn the boundary between its internal parametric knowledge and external knowledge. 2. The baseline experimental comparison is thorough, and the writing is clear.

Weaknesses

1. The novelty of this paper is limited. It appears to merely integrate the exploration path strategy for balancing internal and external knowledge from DeepRAG into the Search-R1 framework. Essentially, it only designs a more complex reward function based on Search-R1, incorporating four different key behaviors, and is essentially an extension of DeepSearch-like works. 2. The paper lacks appropriate experimental design to support the claim that its method is indeed more effective in learning th

Reviewer 04Rating 6Confidence 4

Strengths

1. Proposes a reinforcement-learning-based framework for internal–external knowledge synergistic reasoning. 2. The reward function is well-designed, explicitly modeling the trade-off between retrieval cost and answer accuracy. 3. Achieves stable performance improvements across multiple datasets.

Weaknesses

1. Limited generalization on Hard/OOD scenarios: Although IKEA aims to trigger retrieval when internal knowledge is insufficient, its improvements over Search-R1 on Hard/OOD datasets (e.g., PopQA and 2Wiki) are modest or even negative, suggesting that the model still tends to over-rely on internal knowledge in out-of-distribution settings. 2. Mismatch between model scale and training data: Models of different sizes have distinct distributions of internal parametric knowledge. If the same trainin

Reviewer 05Rating 2Confidence 4

Strengths

Addresses a practical issue in Retrieval-Augmented Generation (RAG) systems: over-reliance on retrieval even when internal knowledge is sufficient, which leads to increased latency and potential knowledge conflicts. The design of knowledge boundary-aware rewards is intuitive and reasonable, with clear hierarchical structure for behavioral preferences. Evaluated on multiple datasets, including both in-distribution and out-of-distribution tests.

Weaknesses

1. Limited technical innovation. Its reward shaping strategy for existing Reinforcement Learning (RL) methods essentially only adopts the "Search-R1" framework combined with reward function construction. 2. The comparison scope for baseline model construction is limited and not fully expanded. 3. The 1:1 ratio of easy-to-hard questions lacks theoretical basis, and no ablation experiments have been conducted under different ratios. 4. Lacks a formal and rigorous definition of the "knowledge bound

Code & Models

Repositories

hzy312/knowledge-r1
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Information Retrieval and Search Behavior

MethodsAttentive Walk-Aggregating Graph Neural Network