Repurposing Synthetic Data for Fine-grained Search Agent Supervision

Yida Zhao; Kuan Li; Xixi Wu; Liwen Zhang; Dingchu Zhang; Baixuan Li; Maojia Song; Zhuo Chen; Chenxi Wang; Xinyu Wang; Kewei Tu; Pengjun Xie; Jingren Zhou; Yong Jiang

arXiv:2510.24694·cs.CL·February 25, 2026

Repurposing Synthetic Data for Fine-grained Search Agent Supervision

Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang

PDF

3 Reviews

TL;DR

This paper introduces E-GRPO, a novel training framework that leverages entity information in synthetic data to improve the learning and reasoning efficiency of search agents in complex QA tasks.

Contribution

The paper proposes E-GRPO, a new entity-aware reward method that enhances training by utilizing discarded entity information, leading to better performance and reasoning efficiency.

Findings

01

E-GRPO outperforms the baseline GRPO in diverse QA benchmarks.

02

E-GRPO enables learning from near-miss samples through partial rewards.

03

Models trained with E-GRPO require fewer tool calls, indicating more efficient reasoning.

Abstract

LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- Well-motivated improvement: Model intermediate rewards for partially correct rollouts - Improvement over GRPO visible in empirical evaluation - Stabilization of training process increases efficiency

Weaknesses

- Empirical gain rather low compared to GRPO

Reviewer 02Rating 6Confidence 5

Strengths

* **Well-motivated problem**. Although partial correctness has been studied in RL, introducing it in training search agents is a good problem to facilitate this community. * **Technical soundness**. The solution using the entity-matching score is reasonable, and the analysis of the relationship between accuracy and the matching score supports the insights of the proposed method.

Weaknesses

1. **Handling of incorrect reasoning with correct entities**. The entity-matching reward may inadvertently credit erroneous reasoning paths that happen to mention correct entities without proper understanding. For instance, a model might generate factually incorrect statements while coincidentally including the right entity names. The paper does not address how to distinguish between genuine entity identification and spurious mentions. 2. **Overclaim**. While the entity-aware reward is effective

Reviewer 03Rating 4Confidence 4

Strengths

- The reward formulation originated from a good observation in the data and its design is clear and reasonable. - The empirical performance appears to be strong and better than vanilla SFT & GRPO.

Weaknesses

- The research contribution appears to be incremental. While E-GRPO seems to outperform GRPO, I am not fully convinced E-GRPO is a fundamentally “novel framework” compared to the original GRPO given the fact that it only customizes the reward function. Partial entity matching in RL is also an established method, especially in NL2SQL field (e.g. Reasoning-SQL). - It is nuanced whether the performance comparison with other baseline models in the main results is a fair comparison using same trainin

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.