OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
Erhan Zhang, Yiqun Chen, Zechun Niu, Wei Yang, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

TL;DR
OASES introduces an outcome-aligned supervision framework for agentic search, improving reward signals and search policy training by evaluating intermediate states based on their support for the final answer.
Contribution
The paper proposes a novel co-training framework that aligns process rewards with outcomes and adapts evaluators to evolving search policies, enhancing multi-hop QA performance.
Findings
OASES outperforms strong RL baselines on five multi-hop QA benchmarks.
Outcome-aligned process rewards improve the reliability of supervision.
Search-evaluation co-training benefits the adaptability and effectiveness of agentic search.
Abstract
Agentic search enables language models to solve knowledge-intensive tasks by adaptively acquiring external evidence over multiple steps. Reinforcement learning with verifiable rewards (RLVR) has emerged as a widely adopted training paradigm for search agents, yet outcome-only rewards are sparse and provide limited credit assignment for intermediate search actions. Existing process-reward methods therefore seek to densify supervision through proxy signals, external evaluators, or likelihood-based information gain. However, proxy rewards can deviate from the final outcome objective, while fixed evaluators can become stale as the search policy evolves, leading to unreliable process supervision. To address these challenges, we propose OASES, an Outcome-Aligned Search-Evaluation Supervision framework for agentic search. OASES derives outcome-aligned process rewards by evaluating how well…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
