TL;DR
This paper introduces ActGuide-RL, a method that uses human action data as guidance to improve agentic reinforcement learning in large language models, reducing reliance on costly supervised fine-tuning.
Contribution
It proposes a novel approach that injects action data as plan-style guidance, enabling better exploration and learning in reward-sparse tasks without extensive fine-tuning.
Findings
ActGuide-RL significantly outperforms zero RL on search-agent benchmarks.
It matches the performance of SFT+RL pipelines without cold start.
The method effectively internalizes exploration gains through mixed-policy training.
Abstract
Agentic reinforcement learning (RL) for Large Language Models (LLMs) critically depends on the exploration capability of the base policy, as training signals emerge only within its in-capability region. For tasks where the base policy cannot reach reward states, additional training or external guidance is needed to recover effective learning signals. Rather than relying on costly iterative supervised fine tuning (SFT), we exploit the abundant action data generated in everyday human interactions. We propose \textsc{ActGuide-RL}, which injects action data as plan-style reference guidance, enabling the agentic policy to overcome reachability barriers to reward states. Guided and unguided rollouts are then jointly optimized via mixed-policy training, internalizing the exploration gains back into the unguided policy. Motivated by a theoretical and empirical analysis of the benefit-risk…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
