Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents
Hojae Han, Heeyun Jung, Jongyoon Kim, Seung-won Hwang

TL;DR
This paper introduces David-GRPO, a reinforcement learning method that enhances multi-hop reasoning agents under resource constraints by combining expert trajectories and evidence-guided exploration, leading to deeper retrieval and better evidence coverage.
Contribution
The paper proposes David-GRPO, a novel RL approach that improves small-batch training for multi-hop reasoning agents by integrating expert data and evidence-based exploration strategies.
Findings
David-GRPO outperforms prior RL baselines on six multi-hop QA benchmarks.
Agents trained with David-GRPO increase retrieval depth and evidence coverage.
The method is effective on models up to 1.5B parameters using limited GPU resources.
Abstract
Multi-turn reasoning agents solve complex questions by decomposing them into intermediate retrieval or tool-use steps, for accumulating supporting evidence across turns. Meanwhile, with reinforcement learning (RL), training these agents rely on many on-policy rollouts and large training batches. Under realistic resource constraints that make dense exploration infeasible, each RL batch contains only few useful reasoning paths from the current policy. Existing approaches do not fully address this bottleneck: SFT-based initialization can overfit when annotated trajectories are scarce, retrieval-level rewards can assign credit to individual retrieved documents without directly optimizing coverage of the full evidence set, and expansion can waste rollouts from poorly chosen prefixes. We introduce David-GRPO, which improves small-batch learning by using information from both outside and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
