OThink-SRR1: Search, Refine and Reasoning with Reinforced Learning for Large Language Models
Haijian Liang, Zenghao Niu, Junjie Wu, Changwang Zhang, Wangchunshu Zhou, Jun Wang

TL;DR
OThink-SRR1 is a reinforcement learning framework that improves large language models' multi-hop reasoning by iteratively searching, refining, and reasoning with relevant, concise evidence, reducing noise and computational costs.
Contribution
It introduces an iterative Search-Refine-Reason process trained with reinforcement learning, including a novel algorithm GRPO-IR, to enhance accuracy and efficiency in multi-hop question answering.
Findings
Achieves superior accuracy on four multi-hop QA benchmarks.
Uses fewer retrieval steps and tokens compared to strong baselines.
Effectively balances evidence relevance and retrieval efficiency.
Abstract
Retrieval-Augmented Generation (RAG) expands the knowledge of Large Language Models (LLMs), yet current static retrieval methods struggle with complex, multi-hop problems. While recent dynamic retrieval strategies offer improvements, they face two key challenges: 1) irrelevant retrieved noise can misdirect the reasoning process, and 2) processing full documents incurs prohibitive computational and latency costs. To address these issues, we propose OThink-SRR1, a framework that enhances large models with an iterative Search-Refine-Reason process trained via reinforcement learning. Its core Refine stage distills retrieved documents into concise, relevant facts before reasoning. We introduce GRPO-IR, an end-to-end reinforcement learning algorithm that rewards accurate evidence identification while penalizing excessive retrievals, thus training the model to be both focused and efficient.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
